Contact Recommendation: Effects on the Evolution of Social Networks

(1)

UNIVERSIDAD AUTONOMA DE MADRID

ESCUELA POLITECNICA SUPERIOR

TRABAJO FIN DE MÁSTER

Contact Recommendation:

Effects on the Evolution of Social Networks

Doble Máster Universitario en Ingeniería Informática y en Investigación e Innovación en Tecnologías de la Información y las Comunicaciones

Autor: SANZ-CRUZADO PUIG, Javier Tutor: CASTELLS AZPILICUETA, Pablo

FECHA: Febrero, 2017

(2)

(3)

i

Resumen

En los últimas dos décadas y media, el desarrollo y crecimiento de los sistemas de recomendación ha progresado cada vez más rápido. Esta expansión ha dado lugar a la confluencia entre las tecnologías de recomendación y otras áreas adyacentes, y, en particular, con las tecnologías de redes sociales, que han experimentado un crecimiento exponencial en los últimos años. El presente trabajo explora uno de los problemas más novedosos que surgen de la confluencia entre ambas áreas: la recomendación de contactos en redes sociales.

Nuestro trabajo se centra, por un lado, en obtener una perspectiva completa de la efectividad de una amplia selección de algoritmos de recomendación, incluyendo algunas contribuciones originales, y considerando perspectivas novedosas que van más allá del acierto de la recomendación. Por otro, en el estudio de la influencia que los algoritmos de recomendación de contactos ejercen sobre la evolución de las redes sociales y sus propiedades. Una fracción no despreciable de los nuevos enlaces que aparecen en las modernas redes sociales online (como Twitter, LinkedIn o Facebook) son creados a través de sugerencias de contactos personalizadas de la plataforma de red social. Los sistemas de recomendación están convirtiendose en un factor importante para influenciar la evolución de la red. Comprender mejor este efecto y aprovechar la oportunidad de obtener más beneficios de la acción de los recomendadores desde una perspectiva amplia de la red son, por tanto, direcciones de investigación que merece la pena investigar, y que estudiamos aquí.

Nuestro estudio comprende trabajo teórico y algorítmico, incluyendo la definición y adaptación de métricas de evaluación novedosas. Esto lo complementamos con un exhaustivo trabajo experimental, en el que comparamos múltiples algoritmos de recomendación desarrollados en diferentes áreas, incluyendo la predicción de enlaces, los sistemas de recomendación clásicos y la recuperación de información, junto con otros algoritmos propios del campo de recomendación de contactos. Hemos evaluado los efectos en la evolución de las redes sociales mediante experimentos offline sobre varios grafos de la red social Twitter. Hemos considerado dos tipos de grafos: grafos de interacción entre usuarios (retweets, menciones y respuestas) y grafos de amistad explícitos (relaciones de follow). Con dichos experimentos, se ha medido no sólo el acierto de los recomendadores: también se han estudiado perspectivas más novedosas, como la novedad y diversidad de las recomendaciones, y sus efectos sobre las propiedades estructurales de la red.

Finalmente, hemos analizado los efectos de promocionar ciertas métricas globales de diversidad estructural de las recomendaciones sobre el flujo de información que viaja a través de las redes, en términos de la velocidad de la difusión y de la diversidad de la información que reciben los usuarios.

(4)

Recomendación, redes sociales, Twitter, evaluación, novedad, diversidad, diversidad estructural, evolución, difusión, predicción de enlaces.

(5)

iii

Abstract

Over the last two and a half decades, the development and expansion of recommender systems has progressed increasingly fast. This expansion has given place to the confluence between recommendation technologies and other adjacent areas, notably social networks technologies, which have similarly experienced an exponential growth of their own in the last few years. This thesis explores one of the most novel problems arised from the confluence between both areas: the recommendation of contacts in social networks.

Our work focuses, on one hand, on gaining a comprehensive perspective of the effectiveness of a wide range of recommendation algorithms including some of our own original contributions, and considering novel target perspectives beyond the recommendation accuracy. And on the other, on the study of the influence that contact recommendation algorithms have on the evolution of social networks and their properties. A non-negligible fraction of the new links between pairs of users in modern online social networks (such as Twitter, Facebook or LinkedIn) are created through personalized contacts suggestions made by the social network platform. Recommender systems are hence becoming an important factor influencing the evolution of the network. Better understanding this efffect, and taking advantage of the opportunity to draw further benefit from the action of recommenders with a broader network perspective, are therefore a worthwile research direction which we aim to undertake here.

Our study comprises algorithmic and theoretical work, including the definition and adaptation of novel evaluation metrics. We complement this with extensive experimental work, where we start by comparing multiple recommendation algorithms developed in different areas including link prediction, classical recommender systems and text information retrieval along with other algorithms from the contact recommendation field. We have evaluated the effects over the evolution of social networks via offline experiments over several graphs extracted from the Twitter social network. Two different types of graphs have been considered: graphs which represent the different interactions between users (retweets, replies and mentions) and explicit graphs (follows relations). With those experiments, we have not only measured the accuracy of the recommendation algorithms, but also more novel perspectives such as the novelty and diversity of the recommendations, and their effects on the structural properties of the network.

Finally, we have measured the effects of enhancing the structural diversity of the recommendation over the flow of information which travels through the network

(6)

Keywords

Recommendation, social networks, Twitter, evaluation, novelty, diversity, structural diversity, evolution, information diffusion, link prediction.

(7)

v

Agradecimientos

En primer lugar, me gustaría agredecer a mi tutor, Pablo Castells, la oportunidad de desarrollar el presente trabajo en el Grupo de Recuperación de Información, así como su constante guía y apoyo para sacar adelante todo esto. También quiero dar las gracias a todos aquellos que han pasado por el grupo en este último año y medio: Rocío, Sofía, Nacho... Gracias a vosotros, este trabajo ha sido más sencillo.

No puedo olvidarme de todos aquellos que me han acompañado en aquellas prácticas del máster que parecían interminables: Dani, Rafa, Carlos, Noemi, Guido,… En especial, de entre todos ellos, me gustaría dar las gracias a Rus por su trabajo, esfuerzo y apoyo ante cualquier problema.

Fuera del ámbito universitario, me gustaría dar las gracias a mis amigos, que me han apoyado durante todos estos años: Nadia, Arabia, Jose… Sin vosotros, esto sería imposible. Finalmente, pero no por ello menos importante, me gustaría dar las gracias a toda mi familia por su apoyo y cariño durante toda mi vida.

(8)

(9)

vii

Table of Content

1. Introduction ... 1

1.1 Motivation ... 1

1.2 Goals ... 2

1.3 Document structure ... 2

1.4 Notation ... 3

2. State of the art ... 5

2.1 Recommender Systems ... 5

2.1.1 Recommendation algorithms... 6

2.1.2 Evaluation... 7

2.2 Social Networks ... 10

2.2.1 Structural properties of social networks ... 11

2.2.2 Communities ... 13

2.2.3 Strength of links ... 15

2.2.4 Evolution ... 15

2.2.5 Link prediction ... 17

2.2.6 Diffusion... 18

2.3 Social Recommendation ... 19

2.3.1 Item recommendation ... 19

2.3.2 Contact recommendation... 20

3. Recommendation Algorithms ... 25

3.1 Trivial Recommendation ... 25

3.2 Link Prediction Algorithms ... 26

3.2.1 Neighborhood-based Methods ... 26

3.2.2 Path-based Methods ... 29

3.2.3 Random Walk-based Methods ... 32

3.3 Twitter Who-To-Follow ... 35

3.4 Recommender System Methods ... 39

3.4.1 K-Nearest Neighbors ... 40

3.4.2 Matrix Factorization ... 41

3.5 Text Information Retrieval Methods ... 44

3.6 Content-Based Algorithms ... 48

4. Evaluation metrics ... 49

4.1 Accuracy ... 49

4.2 Novelty ... 51

(10)

viii

4.4 Social Network Structure ... 54

4.4.1 Weak ties ... 57

5. Experiments ... 61

5.1 Data Sets ... 61

5.1.1 Preparation of the Data Sets ... 62

5.1.2 Attribute Spaces ... 65

5.1.3 Description of the Data Sets ... 67

5.2 Research Questions ... 77

5.2.1 Accuracy Perspective ... 78

5.2.2 Other Evaluation Dimensions ... 78

5.2.3 Social Network Types ... 79

5.3 Software configuration and test environment ... 79

5.4 Results ... 80

5.4.1 Accuracy... 80

5.4.2 Other evaluation perspectives ... 100

5.4.3 Conclusions ... 107

6. Information diffusion ... 109

6.1 Research Questions ... 109

6.2 Structural Diversity Enhancement ... 110

6.2.1 Enhancement of global properties ... 111

6.3 Metrics ... 113

6.3.1 Speed metrics ... 113

6.3.2 Diversity metrics ... 114

6.4 Experimental configuration ... 115

6.4.1 Simulation ... 115

6.4.2 Data ... 117

6.4.3 Re-rankers ... 117

6.5 Results ... 119

6.5.1 Speed ... 120

6.5.2 Information diversity ... 123

6.5.3 Conclusions ... 124

7. Conclusions ... 127

7.1 Summary and contributions ... 127

7.2 Future work ... 129

Bibliography ... 131

(11)

ix

Annex I: Derivations ... 139

Contact Recommendation Algorithms ... 139

BM25 ... 139

PurePersonalized PageRank ... 140

Metrics ... 142

Modularity ... 142

Assortativity ... 144

Annex II: Complete results ... 147

1 Month ... 147

Interactions ... 147

Follows ... 157

200 Tweets ... 167

Interactions ... 167

Follows ... 177

(12)

x

Figure 1. The recommendation task. The scores in red are generated by the

recommender system. ... 5

Figure 2. Simplified latent variables example in the context of computer games... 7

Figure 3. Train/Test data split in offline experiments ... 8

Figure 4. Twitter degree distributions in log-log scale (from Myers et al. 2014) ... 12

Figure 5. Triads ... 12

Figure 6. Connected components of a directed graph ... 13

Figure 7. Contact recommendation provided by Twitter Who-To-Follow system ... 21

Figure 8. Resource allocation ... 29

Figure 9. Spanning divergent forests for a 3-node cycle graph ... 31

Figure 10. Consumer-Producer graph (adapted from Goel et al. 2015) ... 36

Figure 11. Probabilistic Matrix Factorization Graphical Model (adapted from Salakhutdinov & Minh 2007) ... 43

Figure 12. Graphs with the same modularity ... 58

Figure 13. A tweet with a hashtag ... 61

Figure 14. Retweet example ... 62

Figure 15. Mention example ... 62

Figure 16. Reply example ... 62

Figure 17. Sampling example ... 64

Figure 18. Louvain community detection algorithm (from Blondel et al. 2008) ... 66

Figure 19. Complete 1 Month interactions graph (Colors represent Louvain communities detected on the training graph). ... 68

Figure 20. 1 Month follows graph (Colors represent Louvain communities detected on the training graph) ... 69

Figure 21. Louvain community sizes (1 Month interactions) ... 71

Figure 22. Leading Vector community sizes (1 Month interactions) ... 71

Figure 23. Infomap community distribution (1 Month interactions) ... 71

Figure 24. Louvain community sizes (1 Month follows) ... 71

Figure 25. Leading Vector community sizes (1 Month follows) ... 71

Figure 26. Training graphs degree distributions (1 Month) ... 72

Figure 27. Infomap community distribution (1 Month follows) ... 73

(13)

Figure 28. 200 Tweets interactions graph (Colors represent Louvain communities

detected on the training graph). ... 74

Figure 29. 200 Tweets follows graph (Colors represent Louvain communities detected on the training graph) ... 74

Figure 30. Louvain community sizes (200 Tweets interactions) ... 75

Figure 31. Leading Vector community sizes (200 Tweets interactions) ... 75

Figure 32. Training graphs degree distributions (200 Tweets) ... 76

Figure 33. Infomap community distribution (200 Tweets interactions) ... 77

Figure 34. Louvain community sizes (200 Tweets follows graph) ... 77

Figure 35. Leading Vector community sizes (200 Tweets follows graph) ... 77

Figure 36. Infomap community size distribution (200 Tweets follows graph) ... 78

Figure 37. Comparison between Interactions 1 month and Follows 1 month algorithm rankings ... 84

Figure 38. Comparison between Interactions 200 Tweets and Follows 200 Tweets algorithm rankings ... 85

Figure 39. Comparison between Interactions 1 Month and Interaction 200 Tweets algorithm rankings ... 86

Figure 40. Comparison between Interactions 1 Month and Interaction 200 Tweets algorithm rankings ... 87

Figure 41. P@10 for the neighborhood-based link prediction methods ... 89

Figure 42. P@10 for the path-based link prediction methods ... 90

Figure 43. P@10 for the random-walk base link prediction methods ... 91

Figure 44. P@10 for the Twitter Who-To-Follow methods ... 92

Figure 45. P@10 for the classical recommendation algorithms ... 94

Figure 46. P@10 for the adaptations of IR algorithms ... 95

Figure 47. P@10 for the content-based algorithms ... 96

Figure 48. Different common neighborhoods options ... 97

Figure 49. Effects of contact recommendation ... 109

Figure 50. Independent top-1 re-ranking for enhancing community in-Gini example. Black arrows represent the existing links and red arrows the recommendation links. . 111

Figure 51. Community in-Gini global top-1 re-ranker example (λ=1)... 112

Figure 52. Twitter model diffusion links example ... 116

Figure 53. Push-Pull model diffusion links example. In both steps, links have been selected randomly from the adjacent nodes. More configurations are possible. ... 117

Figure 54. Clustering Coefficient and P@10 values for Inverse Clustering Coefficient Rerankers ... 118

(14)

xii

... 119

Figure 57 Diffusion speed (Twitter protocol)... 120

Figure 58. Diffusion speed (Push-Pull protocol) ... 120

Figure 59. Degree distribution for the expanded graph for the Random algorithm ... 121

Figure 60. Degree distribution for the extended graph for Popularity algorithm ... 121

Figure 61. Diffusion speed for the different ImplicitMF re-rankers (Twitter protocol)122 Figure 62. Diffusion speed for the different ImplicitMF re-rankers (Push-Pull protocol) ... 123

Figure 63. Hashtag-Global Gini results for the different rerankers (Twitter protocol) 124 Figure 64. Hashtag-User Gini results for the different rerankers (Twitter protocol) ... 124

Figure 65. Hashtag-Global Gini results for the different rerankers (Push-Pull protocol) ... 125

Figure 66. Hashtag-User Gini results for the different rerankers ... 125

(15)

Table Index

Table 1. Interactions 1 Month partition ... 68

Table 2. 1 Month follows partition ... 69

Table 3. Training set metrics (1 Month) ... 70

Table 4. Community metrics (1 Month interactions) ... 70

Table 5. Community metrics (1 Month follows) ... 70

Table 6. 200 Tweets interactions partition ... 73

Table 7. 200 Tweets follows partition ... 73

Table 8. Training set metrics (200 Tweets) ... 75

Table 9. Community metrics (200 Tweets interaction graph) ... 75

Table 10. Community-based metrics (200 Tweets follow graph) ... 77

Table 11. P@10 for the best and worst algoritms (Interactions 1 Month) ... 81

Table 12. P@10 for the best and worst algorithms (Follows 1 Month) ... 81

Table 13. P@10 for the best and worst algorithms (Interactions 200 Tweets) ... 82

Table 14. P@10 for the best and worst algorithms (Follows 200 Tweets) ... 82

Table 15. P@10 comparison for the possible neighborhood selections of different recommendation algorithms (1 Month interactions) ... 98

Table 16. P@10 comparison for the possible neighborhood selections of different recommendation algorithms (1 Month follows) ... 99

Table 17. P@10 comparison for the possible neighborhood selections of different recommendation algorithms (200 Tweets interactions) ... 99

Table 18. P@10 comparison for the possible neighborhood selections of different recommendation algorithms (200 Tweets follows) ... 100

Table 19. Highly correlated metrics ... 101

Table 20. Selection of some of the most interesting metrics and algorithms (Interactions 1 Month) ... 102

Table 21. Selection of some of the most interesting metrics and algorithms (Follows 1 Month) ... 103

Table 22. Selection of some of the most interesting metrics and algorithms (Interactions 200 Tweets) ... 104

Table 23. Selection of some of the most interesting metrics and algorithms (Follows 200 Tweets) ... 105

Table 24. Comparison of the different algorithms in terms of P@10. R@10 and nDCG@10 (1 Month interactions graph) ... 148

(16)

xiv

Table 26. Comparison of the different algorithms in terms novelty and diversity (1 Month interactions graph) (2 of 2) ... 150 Table 27. Comparison of the different algorithms in terms of clustering coefficient.

embeddedness and distance (1 Month interactions graph) (1 of 2) ... 151 Table 28. Comparison of the different algorithms in terms of clustering coefficient.

embeddedness and distance (1 Month interactions graph) (2 of 2) ... 152 Table 29. Comparison of the different algorithms in terms of assortativity. modularity and weak ties (1 Month interactions graph) (1 of 2) ... 153 Table 30. Comparison of the different algorithms in terms of assortativity. modularity and weak ties (1 Month interactions graph) (2 of 2) ... 154 Table 31. Comparison of the different algorithms in terms of community Gini (1 Month interactions graph) (1 of 2) ... 155 Table 32. Comparison of the different algorithms in terms of community Gini (1 Month interactions graph) (2 of 2) ... 156 Table 33. Comparison of the different algorithms in terms of P@10. R@10 and nDCG@10 (1 Month follows graph) ... 158 Table 34. Comparison of the different algorithms in terms novelty and diversity (1 Month follows graph) (1 of 2) ... 159 Table 35. Comparison of the different algorithms in terms novelty and diversity (1 Month follows graph) (2 of 2) ... 160 Table 36. Comparison of the different algorithms in terms of clustering coefficient.

embeddedness and distance (1 Month follows graph) (1 of 2)... 161 Table 37. Comparison of the different algorithms in terms of clustering coefficient.

embeddedness and distance (1 Month follows graph) (2 of 2)... 162 Table 38. Comparison of the different algorithms in terms of assortativity. modularity and weak ties (1 Month follows graph) (1 of 2) ... 163 Table 39. Comparison of the different algorithms in terms of assortativity. modularity and weak ties (1 Month follows graph) (2 of 2) ... 164 Table 40. Comparison of the different algorithms in terms of community Gini (1 Month follows graph) (1 of 2) ... 165 Table 41. Comparison of the different algorithms in terms of community Gini (1 Month follows graph) (2 of 2) ... 166 Table 42. Comparison of the different algorithms in terms of P@10. R@10 and nDCG@10 (200 Tweets interactions graph) ... 168 Table 43. Comparison of the different algorithms in terms novelty and diversity (200 Tweets interactions graph) (1 of 2)... 169 Table 44. Comparison of the different algorithms in terms novelty and diversity (200 Tweets interactions graph) (2 of 2)... 170

(17)

Table 45. Comparison of the different algorithms in terms of clustering coefficient.

embeddedness and distance (200 Tweets interactions graph) (1 of 2) ... 171 Table 46. Comparison of the different algorithms in terms of clustering coefficient, embeddedness and distance (200 Tweets interactions graph) (2 of 2) ... 172 Table 47. Comparison of the different algorithms in terms of assortativity, modularity and weak ties (200 Tweets interactions graph) (1 of 2) ... 173 Table 48. Comparison of the different algorithms in terms of assortativity, modularity and weak ties (200 Tweets interactions graph) (2 of 2) ... 174 Table 49. Comparison of the different algorithms in terms of community Gini (200 Tweets interactions graph) (1 of 2)... 175 Table 50. Comparison of the different algorithms in terms of community Gini (200 Tweets interactions graph) (2 of 2)... 176 Table 51. Comparison of the different algorithms in terms of P@10, R@10 and nDCG@10 (200 Tweets follows graph) ... 178 Table 52. Comparison of the different algorithms in terms novelty and diversity (200 Tweets follows graph) (1 of 2) ... 179 Table 53. Comparison of the different algorithms in terms novelty and diversity (200 Tweets follows graph) (2 of 2) ... 180 Table 54. Comparison of the different algorithms in terms of clustering coefficient, embeddedness and distance (200 Tweets follows graph) (1 of 2) ... 181 Table 55. Comparison of the different algorithms in terms of clustering coefficient, embeddedness and distance (200 Tweets follows graph) (2 of 2) ... 182 Table 56. Comparison of the different algorithms in terms of assortativity, modularity and weak ties (200 Tweets follows graph) (1 of 2) ... 183 Table 57. Comparison of the different algorithms in terms of assortativity, modularity and weak ties (200 Tweets follows graph) (2 of 2) ... 184 Table 58. Comparison of the different algorithms in terms of community Gini (200 Tweets follows graph) (1 of 2) ... 185 Table 59. Comparison of the different algorithms in terms of community Gini (200 Tweets follows graph) (2 of 2) ... 186

(18)

(19)

1

1. Introduction

1.1 Motivation

The information that can be accessed by the average citizen in the different aspects of his daily life has grown to massive scale. The difficulty of manually handling this information has motivated the growth of personalized recommendation technologies to help in the discovery of products or contents that provide value to the users. Taking into account the individual preferences of each user, recommender systems filter the available information and select those items the user might be interested in, according to their prediction.

Recommender systems have been in development since the early 90s, and their development and expansion have progressed increasingly fast in the last few years.

Initially, those systems were mainly oriented to e-commerce, and today, they are pervasive in the most varied areas, beyond the best-known examples such as Amazon (pioneer enterprise in the field), eBay or Walmart. More recently, recommender systems have been integrated in virtually every domain, such as news (Google News), audiovisual content (Netflix, Spotify, Youtube), personalized advertisement (Google AdSense), or software and apps stores (Google Play, Steam).

This expansion has given place to the confluence between recommendation technologies and other adjacent areas, notably social networks technologies, which have similarly experienced an exponential growth in the last few years. This thesis explores one of the most novel problems arised from the confluence between both areas: the recommendation of contacts in social networks. This problem poses an special characteristic in relation to the classical recommendation tasks: in those tasks, items and users were separate objects. However, in this case, the items to recommend are chosen among the set of users, and there is additional information for the recommenders, like the structure of the links and interactions between the users in the network.

We pursue several goals: On one hand, we aim to search, study and analyze the state of the art in the field of contact recommendation in social networks. On the other hand, we explore the definition and implementation of new algorithms, and compare their effectiveness and properties with algorithms previously documented in the literature for recommending contacts. Finally, we provide new perspectives for the evaluation of link recommendation in social networks, related to the novelty and the diversity of the recommendation, as well as the collective benefit.

A specific perspective for the present work consists in the study of the influence that contact recommendation algorithms have in the evolution of social networks and their properties. A great fraction of the new links between pairs of users in social networks like Twitter, LinkedIn or Facebook are created through personalized contact suggestions made by the social network platform, so recommenders systems are hence becoming an important factor influencing the evolution of the network and its properties. Better understanding this efffect, and taking advantage of the opportunity to draw further benefit from the action of recommenders with a broader network perspective, are therefore a worthwile research direction which we aim to undertake

(20)

here. The properties of networks can be studied from several perspectives, like the ones which have been in development in the social network analysis fields: density, degree distribution, distances, clustering coefficient, modularity, strength of the links, behavior in propagation phenomena, etc. (Newman et al. 2010, Easley et al. 2010). The connection between the large set of measures provided by social network analysis, and the effect which recommendation may have over them provides a research opportunity which has not been widely explored yet.

1.2 Goals

The main objective of the present work consists in measuring the effects of several contact recommendation algorithms in the evolution of social networks. This main objective is subdivided in the following goals:

 Reproduce and compare previously documented algorithms in the context of user recommendation in social networks, and adapt others which have not been applied in this context yet.

 Differentiate between explicit networks (follows networks in Twitter) and interaction networks (retweet, reply, mention in Twitter). Analyze if the most effective algorithms are the same in both scenarios, or there are differences.

 Explore and analyze the meaning and utility of novelty and diversity metrics in the context of contact recommendations.

 Analyze the effect of the directionality of the edges in the effectiveness of the algorithms. Traditionally, algorithms documented in the literature have focused on undirected graphs. In this thesis, the behavior of the different algorithm variants for directed graphs which may take different directions for the edges will be tested and analyzed.

 Better understand the effects of recommendations in the global evolution of social networks. Use this understanding to apply it to the different algorithms and recommendation strategies so several properties which may be desirable in networks may be optimized. Several novel perspectives which go beyond the accuracy of the recommendations are considered, such as studying novelty and diversity metrics, as well as global properties of the networks, with the goal of improving their characteristics as a whole. We consider social networks as dynamic entities which evolve under different influences. Among them, recommender systems might play an important role.

1.3 Document structure

The present work is divided in 7 chapters and two annexes, which are detailed next:

 Chapter 1. Introduction: Motivation and goals of the present work. The notation which will be used in the rest of the document is also described here.

 Chapter 2. State of the art: A review of the basic concepts and previous work done in the different areas that this work covers. We focus on two different directions: recommender systems methods and evaluation techniques, focusing on the particular case of social recommendation, and social network analysis techniques.

 Chapter 3. Recommendation Algorithms: We thoroughly describe the different recommendation algorithms we use in our research.

(21)

 Chapter 4. Evaluation Metrics: We introduce the different evaluation perspectives we will use for comparing the recommendation algorithms in our experiments. Also, we will explain in detail the different metrics associated to each perspective.

 Chapter 5. Experiments: In this chapter, we exhaustively compare and analyze the effectiveness of several recommendation algorithms in terms of accuracy, novelty, diversity and a novel perspective known as structural diversity, which measures the effects of the recommendation algorithms on the properties of social networks.

 Chapter 6. Information Diffusion: In this chapter, we analyze the effects of the structural diversity metrics on the speed and diversity of the information which flows through the network.

 Chapter 7. Conclusion: In this chapter, we summarize the contributions of the present document, and propose several research lines to further explore the contact recommendation problem in social networks.

 Annex I. Derivations: Mathematical derivations of several new algorithms and metrics.

 Annex II. Complete Experimental Results: Complete results of the comparative of contact recommendation algorithms in terms of accuracy, novelty, diversity and structural diversity.

1.4 Notation

𝒰 Set of users of the social network graph.

𝐸 Set of edges of the social network graph.

𝐸_{𝑡𝑟𝑎𝑖𝑛} Edges in the training partition of the graph.

𝐸_{𝑡𝑒𝑠𝑡} Edges in the test partition of the graph.

𝐴_𝑖𝑗 Element in the 𝑖-th row and the 𝑗-th column of the adjacency matrix of a network.

𝒵 Set of aspects of the nodes (for diversity metrics).

𝒞 Set of communities of the graph.

Γ(𝑢) Set of neighbours of user 𝑢. This notation may mean any directionality for the selected edges.

Γ_𝑖𝑛(𝑢) Set of incident nodes to user 𝑢. It is the set of users which follow 𝑢 or have interacted with 𝑢.

Γ_𝑜𝑢𝑡(𝑢) Set of adjacent nodes to user 𝑢. It is the set of users which 𝑢 is following or 𝑢 has interacted with.

Γ_𝑢𝑛𝑑(𝑢) The union of the sets of incident and adjacent neighbors of user 𝑢.

|𝑋| Number of elements in the set 𝑋.

𝑓_𝑢(𝑣) Recommendation score.

ℛ(𝑢) Set of recommended contacts to user 𝑢.

(22)

(23)

5

2. State of the art

The present work studies the contact recommendation problem in social networks.

Contact recommendation in social networks convenes the confluence of work in recommender systems and social network analysis. In this chapter, we provide a general overview of the most relevant work in these two fields directly related to the goals of our research.

It should be noted that, in this chapter, we only provide a few details on the algorithms, metrics and techniques used in our research, which will be further detailed in the following chapters of this document.

2.1 Recommender Systems

Recommender systems started to be conceived and developed in the early 90’s, and their penetration in everyday applications has been accelerated in recent years. In their begginings, these systems were mainly oriented to e-commerce, and today, recommendation technologies are integrated in most diverse domains including online shopping (Amazon, eBay, Walmart, Fnac, etc.), news (Google News), music and video streaming (Netflix, Spotify, Youtube), personalized advertising (Google AdSense), or app stores (Google Play, Steam). The development of these systems is a multidisciplinary field, which takes elements of Artificial Intelligence, Human Computer Interaction, Data Mining, Statistics, Marketing or Consumer Behaviour.

Figure 1. The recommendation task. The scores in red are generated by the recommender system.

Recommender systems are tools which aim to suggest items to users, according to their preferences or necessities. To this end, they seek to predict the utility of the items for the user. In order to do that, the recommendation task is defined as follows: the system can observe a set of users interacting with a set of items. This observation can be recorded in the form of explicit ratings (e.g. the typical 5-star convention as Amazon, Netflix or Google Play or the binary like/dislike feedback as on Facebook, Instagram or

(24)

Steam) or implicit (the number of times a user interacts with – plays, clicks, buys, etc. – an item).

User-item feedback can be seen as a user-item matrix like the one in Figure 1, where some cells have observed data (e.g. a rating value), and most do not (the system did not observe any interaction between those user-item pairs). In this view, the recommender’s task is to generate a score 𝑓_𝑢(𝑖) for each user-item for which no observation was recorded in the matrix. Based on these scores, for each user (called target in this context) the system ranks the different items in descending order. This is exemplified in Figure 1, showing a recommendation for a single user.

2.1.1 Recommendation algorithms

Many recommendation algorithms have been developed since the field took off (Adomavicius et al. 2005, Ricci et al. 2015). Traditionally, recommender systems have been classified in three categories according to the type of input data taken and how it is processed (Adomavicius et al. 2005): content-based methods, collaborative-filtering methods and hybrid methods.

Content-based methods consider that the user is prone to liking items similar to the ones he liked in the past (Adomavicius et al. 2005). These methods analyze the set of rated items by an user and generate a profile for that user in terms of the set of features that describe the items (for example, in a film recommender system, the genres of the films, the director, the actors, etc.). Then, the utility or relevance of an item is computed as a function of the similarity between those users and the profile. Two limitations have been detected for these approaches: first of all, if two items have the same features, they are completely indistinguishable for the recommender system; secondly, the content- based approach may produce an overspecialization of the recommended items: since only similar items to the ones previously consumed are recommended, the recommendation will not encourage the user to discover different elements from the ones already experienced.

Collaborative filtering algorithms (Goldberg et al. 1992) are considered the most popular and widely implemented recommendation strategies (Ricci et al. 2015). These collaborative approach uses the ratings provided by other users to predict the relevance of the items for a certain user. Two different subfamilies of collaborative filtering algorithms are commonly distinguished: neighborhood-based and model-based algorithms (Adomavicius et al. 2005, Koren et al. 2009). Neighborhood-based algorithms (also known as memory-based or heuristic algorithms) generate recommendations according to the ratings given to a certain neighborhood of the user or the item. The neighborhood is a set of users (or items) with similar ratings on some of the same items (or by some of the same users) to the ones of the target user (or the candidate item). Model-based methods build a processed representation (a model) from the raw rating data, and produce recommendations using the model. A particularly successful family of model-based algorithms are the ones based on so-called latent factors, which seek to characterize both users and items on a common latent space inferred from the ratings patterns (Koren et al. 2009). For the items, each factor captures some latent characteristic of the items. For example, in the case of computer games, one latent variable could represent the difficulty of the game for average users. In real applications, a meaning for the latent factors in terms of the items is hardly found, but examples like the previous one work as an intuition of how these approaches work. For users, the elements of the vector measure to what extent the user is interested in items

(25)

which have high values in the corresponding factor. A simple example in the context of computer games is shown in Figure 2.

The recommendations generated by collaborative filtering overcome some of the limitations of content-based techniques (Ning et al. 2015). In particular, collaborative filtering tecnhniques do not necessarily recommend similar items to the ones previously consumed by the target user, so the recommender system is less likely to overspecialize.

The weak point of collaborative filtering is data sparsity: algorithms take solely as input the ratings given by the users, so new items (with no ratings) cannot be recommended by these methods. Content-based approaches can be a good alternative in such cases.

Figure 2. Simplified latent variables example in the context of computer games Hybrid algorithms, as their name indicates, combine elements from both content- based and collaborative filtering approaches (Adomavicius et al. 2005). These algorithms are created to overcome the disadvantages of both kinds of algorithms. They can be created in many ways, such as combining the outcomes of several recommenders of each one of the types, adding content-based characteristics to a collaborative filtering method, adding collaborative filtering characteristics to a content-based approach, or creating a general unifying model which combines both characteristics.

2.1.2 Evaluation

The development of recommendation algorithms goes hand in hand with their evaluation, to check their quality and utility and compare the properties of different approaches which can be used for the recommendation. Evaluation is peformed by running several tests using the different algorithms we want to compare with real or simulated data. According to the experimental configuration of those tests, we can classify the evaluation experiments in three different types (Shani et al. 2015): offline experiments, user studies and online experiments.

Offline evaluation checks the performance of recommender systems using a pre- collected data set of users choosing or rating items. These experiments assume that the behavior of the collected users will be similar to the one that users in the final system will exhibit (Shani et al. 2015). Since they do not require interactions with users, it is easy to compare the effectiveness of a wide range of algorithms. However, results may

(26)

differ from the real ones, since the dataset may be biased, and they do not allow evaluators to obtain feedback about the performance of the system.

This type of evaluation simulates the online process where the system makes recommendations to users, and the user selects or rates the items which they have considered appropriate. To do that, the whole set of ratings in the data set is partitioned into two separate subsets: the training set and the test set. There are many ways to do this partition: randomly selecting ratings, taking all the ratings created before a given date as the training set, etc. An example is illustrated in Figure 3. These sets represent the ratings before and after applying the split. Recommendation algorithms are run over the ratings in the training set. Then, to check the different properties of the system which define its quality, several metrics are computed over the outcome of these algorithms and the ratings in the test set.

Figure 3. Train/Test data split in offline experiments

User studies specifically recruit a group of users for the purpose of evaluating the system. The users are asked to perform different tasks over the final system, and the actions they make are used to evaluate the recommender system (Shani et al. 2015).

This type of evaluation allows observing and recording the behavior of different users when they interact with the system, as well as identifying how recommender systems influence their behavior. Direct feedback from users can be also received by the comments they made while they use the application. However, user studies are expensive to conduct: first of all, a large set of users must be selected to enable significant results, and then, depending on the number and the size of the different tasks, the study can take a long time to complete. Also, the people selection could present some biases which are not present in the set of users of the real system.

Finally, online experiments measure the performance of systems in production, in the real setting. Tipically, these experiments are run to compare several versions of a system: a small fraction of the traffic to the system is randomly redirected to a different recommendation engine, and the interactions with both systems are recorded and compared. Since the evaluation is done in the real system, the results are the most realistic of all the evaluation experiments. However, the experiments are risky, since

(27)

irrelevant or bad recommendations provided by the alternative systems may discourage the users from using the real one.

In this work, we will focus only on offline experiments, using different data sets for evaluating different contact recommendation algorithms. As we stated before, in offline experiments, we use several metrics to evaluate different properties of the system. There are many properties which can be studied to determine the quality of a recommender system. The most well-known and developed evaluation perspective is the accuracy one (Shani et al. 2015). This perspective checks how similar are the outcomes of the recommendation algorithms and the real user preferences. We can differ two classes of accuracy measures: ratings metrics and ranking metrics.

Rating metrics

When an evaluator uses ratings metrics to evaluate the the accuracy of the recommender system, he measures how close are the scores given to the different items by the recommender system to the real ones. Several measures have been defined. These ones are useful when the recommendation algorithm produces scores in the same range as the ratings. The most well known (Shani et al. 2015) are the Mean Absolute Error (MAE) of the system:

𝑀𝐴𝐸 = 1

|𝑇𝑒𝑠𝑡| ∑ |𝑓_𝑢(𝑖) − 𝑟_𝑢(𝑖)|

(𝑢,𝑖)∈𝑇𝑒𝑠𝑡

(2.1) and the Rooted Mean Squared Error (RMSE):

𝑅𝑀𝑆𝐸 = √ 1

|𝑇𝑒𝑠𝑡| ∑ (𝑓_𝑢(𝑖) − 𝑟_𝑢(𝑖))²

(𝑢,𝑖)∈𝑇𝑒𝑠𝑡

(2.2)

Ranking metrics

Ranking metrics take a different perspective, bringing notion of relevance into play: an item is considered relevant for a user if it satisfies a necessity. In the case of recommender system, we consider that an item is relevant if the user likes it. This means the user assigned the item a positive rating, if such explicit feedback is available, or the user simply consumes the item, if only implicit feedback is available. The metrics in this scope are adapted from Information Retrieval (IR), where relevance is a central notion (Baeza-Yates et al. 2010).

Some of these metrics, like Precision or Recall (Baeza-Yates et al. 2010) are simply related to the number of relevant items in the recommendation ranking, while others, like Normalized Cumulative Gain (Järveling et al. 2000) also consider the position of the items in the ranking, giving more importance to relevant items in the top positions of the recommendation ranking.

This kind of metrics are the ones we have used in our work for the evaluation of the accuracy of the contact recommendation algorithms. More detailed information about the different ranking metrics we have used in our research is shown in chapter 4.

Beyond accuracy: novelty and diversity

Providing accurate recommendations is very useful for the user, but this is only one among several important dimensions of recommendation utility. Other technical properties of the system such as scalability, robustness or the privacy of the systems should be considered to make for an overall good user experience (Shani et al. 2015).

(28)

But even at a more core and conceptual level, further dimensions matter. Since the beginning of the 2000s, two new properties of the recommendations have been paid increasing attention: novelty and diversity (Castells et al. 2015).

The novelty of a system measures the different between the present and past experiences of the users (Castells et al. 2015). In terms of recommender systems, they measure how different are the items suggested by the system to the ones the user already knows. Two different novelty perspectives have been proposed: the first one, user independent, measures the “anti-popularity” of the recommendations, i.e. how unknown are the recommended items for the different users in the system; the second measures the differences between the recommendation provided to a user and the items the user already knows.

Diversity relates to the differences between the items in recommendation rankings, without considering the past experience of the user. Again, two different perspectives have been studied: first, a local perspective, which measures the distances between the items in each individual recommendation; second, a global perspective, which studies to what extent every item in the system has been recommended.

As far as we know, novelty and diversity have not been applied in the context of contact recommendation. This opens a novel research line which we explore in this work. In chapter 4, we will delve further into both perspectives and their adaptation to the user recommendation task.

2.2 Social Networks

A social network is a set of people or groups of people with some pattern of contacts or interactions between them (Newman 2003). They have been an object of study for different fields like psychology, sociology, biology or statistics. The earliest documented works with explicit notions of social networks were undertaken in the area of social sciences and date back to the last years of the 19^th century (Tönnies 1887, Durkheim 1893). The analysis of social networks has many practical uses, such as studying the spread of diseases over a population, understanding how relationships are created, identifying important people in networks, finding latent communities, identifying key connections in the network, predicting social dynamics, or planning marketing campaigns.

The massive transfer of social network information into online platforms starting by the early 2000s opened a whole new horizon for the field, and gave a new meaning to the notion of social network. Platforms like Facebook, Twitter, LinkedIn or Instagram are used every day by hundreds of millions of people worldwide. The availability of visible network data at such an unprecedented scale has multiplied the possibilities for business as much as research, and has boosted the study and exploitation of these networks in the last couple of decades. –what online social networks are to social network science and technology can be compared to what the Web meant for the information retrieval field.

The relationships between different people in social networks can be mathematically modeled as a graph, 𝐺 = 〈𝒰, 𝐸〉, where the nodes, 𝒰, represent the different individuals, and the edges, 𝐸, represent the relationships between users (Easley et al. 2010). These relations can also be seen as a matrix, 𝐴, known as the adjacency matrix of the graph, where:

(29)

𝐴_𝑖𝑗 = {𝑤𝑒𝑖𝑔ℎ𝑡(𝑖, 𝑗) > 0 𝑖f there is a link between nodes 𝑖 and 𝑗

0 otherwise (2.3)

The weight of the link between two users, 𝑖 and 𝑗, can represent the existence of links between nodes (𝑤𝑒𝑖𝑔ℎ𝑡(𝑖, 𝑗) = 1 for unweighted graphs), or some quantitative property associated to the relationship between 𝑖 and 𝑗, such as how strong the relationship is, the number of times user 𝑖 has interacted user 𝑗, and so forth.

Depending on the nature of the relations between users, we can distinguish two different network types:

 Directed or asymmetric networks: These networks represent relationships where interactions between two individuals do not need to be reciprocated. For example, hierarchical relationships inside a company, e-mail networks, or follow networks in Twitter or Instagram. These relations are represented as directed edges in the graph model of the network.

 Undirected or symmetric networks: These networks represent relationships like friendship, where the interaction are reciprocal. The interactions are represented as undirected links in the graph model. Online social networks like Facebook or LinkedIn are examples of these networks.

The area of social network analysis is considerably broad and, as noted before, multidisciplinary, and so is the literature. We overview here the work in this area that is most directly relevant for the goals of our research, and to which which we will make reference throughout the present document.

2.2.1 Structural properties of social networks

One of the main tools of social network analysis consists in the study of the structure of the network graphs. Knowing the structure of the graph is useful for finding the most influential users, determining how the network will evolve, etc. The analysis of the structure of real-world networks has led to the observation of several recurring patterns in their structure: small diameter, skewed degree distribution, etc. (Newman 2010). In this section, we explain the main characteristics of those social networks.

Degree distributions

One of the fundamental properties of a network is the distribution of the vertex degrees.

The degree of a vertex in the network is the defined as the number of edges that have that vertex as one of their end points. In the case of a directed network, we differentiate the out-degree of the node (the number of outgoing edges) and the in-degree (the number of incoming edges). To study the degree distribution of the networks, it is common to represent the values of the degrees against the proportion of the nodes which have that degree, as shown in Figure 4.

In real-world social networks, it is usual to observe right skewed distributions: most of the nodes have a very low degree, but there is a significant “tail” of the distribution, which corresponds to the nodes with higher degree (Newman 2010) –a few nodes connect to a large fraction of the nodes of the network. Those nodes are known as hubs.

In directed social networks, the same stands for the in-degree and out-degree distributions. As an example of this fact, Figure 4, obtained from Myers et al. (2014), shows the in-degree and out-degree distributions of the Twitter graph, as well as the degree distribution of an undirected graph that only contains the links that are reciprocated.

(30)

Figure 4. Twitter degree distributions in log-log scale (from Myers et al. 2014) Average shortest path length

The average shortest path length measures the distances between two different nodes in the network. It is highly related to the so-called small-world effect, one of the most widely discussed phenomena in social networks: in a network, the average distance between pairs of nodes (defined as the average length of the shortest paths between the nodes in each pair of nodes) is very small, considering the huge size of the network.

This effect was first observed in the Milgram’s experiment in the 1960’s (Milgram 1967).

Real-world social networks show this phenomenon: for example, the average shortest path length in Twitter in 2012 (a social network with around 175 million users) was around 4.05 steps (Myers et al. 2014), and the Facebook network in 2011 (with 721 million active users) had an average distance of 4.3 steps (Ugander et al. 2011).

Leskovec et al. (2007) also observed that the distance between nodes in networks does not necessarily increase when the network grows. In fact, they observed that many networks reduced their diameter (the maximum distance between two nodes in the graph) as the network grows.

Clustering coefficient

Clustering coefficient is one of the most simple and widely-known graph metrics, and it is related to the transitivity of the network: it measures the proportion of transitive triads in the network. In this context, a triad is defined as a set of three nodes who form a path of length two, i.e. if we name the users, 𝑢, 𝑣, 𝑤, then, they form a triad if 𝑢 follows 𝑣 and 𝑣 follows 𝑤. A triad is considered transitive if there is an edge between the starting and ending nodes of the path, i.e. 𝑢 follows 𝑤. Examples of triads are shown in Figure 5.

Figure 5. Triads Connected components

A connected component of a network is a maximal subset of the vertices of the network such that there is, at least, a path from each member of the subset to each other

(a) Non-transitive triad (b) Transitive triad

(31)

members. If the network is directed, we differ two types of components: weakly connected components, if we ignore the direction of the edges, and strongly connected components, if we do not. Figure 6 shows an example of a directed graph with two weakly connected components and three strongly connected ones.

a) Weakly connected components b) Strongly connected components

Figure 6. Connected components of a directed graph

Real-world networks usually have a giant component which contains at least half of the nodes of the network, and most often in fact, over 90% (Newman 2010). For example, the giant component in Facebook is estimated to contain 99% of the nodes (Ugander et al. 2011). In directed graphs, this is still true for weakly connected components, but not necessarily for strongly connected ones. As an example, in Twitter, the largest weakly connected component is estimated to contain 92,9% of the users, but the largest strongly connected one only has 68,7% of them (Myers et al. 2014).

2.2.2 Communities

A natural phenomena which occurs in social networks is the spontaneous, explicit or implicit gathering of people in different groups or communities. Communities in a social networks are defined as subsets of nodes with dense connections inside the subset, and sparse connections to people outside that subset (Newman 2006). The formation of these communities is often related to homophily biases (McPherson 2001):

contacts between similar people occur at a higher rate than contacts between very different people. The similarities and differences between people may be related to their preferences, location, social position, proffesion, etc.

The detection of communities is one of the most widely studied problems in social network analysis and graph science. Several models and algorithms have been developed for finding and quantifying communities in networks. All these methods, using the structural properties of the graph, seek to find a partition of the network which maximizes the intracommunity interactions and minimizes the intercommunity interactions. As a partition of the network, communities are related to the concept of connected component. However, due to the giant component phenomenon, connected components are highly restrictive respect to the concept of community, which provides plenty of additional information for the analysis of the network. The quality of a partition is usually evaluated by the so-called modularity of the graph (Newman &

Girvan 2004). This measure computes the number of links inside communities, in relation to the expected number of links in a random multigraph where the degrees of the nodes are the same as the ones in the original graph. It is defined as

(32)

mod(𝐺) =

∑ (𝐴_𝑖𝑗 −|Γ(𝑖)||Γ(𝑗)|

𝑚 ) 𝛿(𝑐_𝑖, 𝑐_𝑗)

𝑖𝑗

𝑚 − ∑ |Γ(𝑖)||Γ(𝑗)|

𝑖,𝑗 𝑚 𝛿(𝑐_𝑖, 𝑐_𝑗)

(2.4)

where Γ(𝑖) represents the set of neighbours of node 𝑖, 𝑚 represents the number of links in the network, 𝑐_𝑖 represents the community that node 𝑖 belongs to, 𝛿(𝑐_𝑖, 𝑐_𝑗) = 1 when 𝑐_𝑖 = 𝑐_𝑗 and 𝛿(𝑐_𝑖, 𝑐_𝑗) = 0 otherwise.

It is known that there is always a partition of a graph which achieves maximum modularity. However, finding such optimal partition of the network is computionally unfeasible: it is an NP-Hard problem (Brandes et al. 2008). Many heuristic methods which provide reasonably good results have been developed (Orman et al. 2011). We describe next the main families of such algorithms.

Some algorithms apply a hierarchical divisive approach, based on link centrality measures. These algorithms iteratively remove edges which minimize a certain measure, until separate components of the graph are obtained. Those components are considered the communities of the graph. Several metrics, like the betweenness of the links, i.e. the number of shortest paths in which the link is included (Newman et al. 2004), or the local clustering coefficient of the nodes, i.e. the number of triangles to which the edge belongs (Radicchi et al. 2003) have been used.

Another approach for obtaining an optimal partition consists in greedily optimizing the modularity of the graph. The so-called Louvain algorithm (Blondel et al. 2008) or the FastGreedy algorithm proposed by Clauset et al. (2004) are examples of this approach. Clauset et al. proposed a hierarchical agglomerative algorithm which iteratively joins the pair of communities whose combination produces the largest increase in the modularity value. The Louvain method iteratively increases the modularity by moving users to other communities in the graph and mantaining those changes which produce the largest increase in the modularity of the graph.

Other algorithms take advantage of the matrix formulation of a graph to use linear algebra tools, like eigenvalues and eigenvectors. For example, the Leading Eigenvector approach (Newman 2006) reformulates the modularity optimization problem as an eigenvector finding problem using a so-called modularity matrix.

Another family uses tools derived from information theory to estimate the best partition of the network. Infomap (Rosvall et al. 2008) belongs to this family, and finds an optimal partition by minimizing the quantity of information needed to represent a random walk in the network.

Finally, some algorithms simulate diffusion processes in the network to identify communities. For example, the Label Propagation method (Raghavan et al. 2007) assigns a unique label to each node in the network and, iteratively, each node adopts and propagates the label which a majority of its neighbors has adopted. At the end of the process, nodes with the same labels form the different communities of the graph.

Yang et al. (2016) provide a comparative of eight different community detection algorithms, including some of the previously mentioned ones, in terms of accuracy and compute time. Comparing the outcomes of these algorithms over artificial networks, they found that Infomap and Louvain algorithms provide better communities than the rest of the algorithms, even when the proportion of links between communities is greater than 50%, and with one of the main community detection. Both algorithms work even better when the graph is large. Leading Eigenvector algorithm is quickly

(33)

outperformed by the rest of algorithms when the number of edges between communities are detected. In terms of complexity, Infomap, Label Propagation (𝒪(𝐸)) and Louvain (𝒪(𝑁 log(𝑁)) are the fastest approaches, while Girvan-Newman (Newman et al. 2004) is the slowest of all (𝒪(𝑁𝐸²)). Yang et al. (2016) also empirically show these results.

2.2.3 Strength of links

The strength of a tie between two people is defined as a combination of the amount of time spent on the relation, emotional intensity, intimacy and reciprocal services which characterize a link between those people (Granovetter 1973). Strong links represent e.g.

ties with family or close friends, while weak ones may represent ties with people you meet at work, shopkeepers in the local market, etc. The advantages and disadvantages of strong and weak ties have been studied since the beginning of 1970s. One of the most influential and important theories is the one proposed in by Mark Granovetter (1973).

Granovetter hypothesized that contacts maintained via weak ties provide more novel information and resources than the ones maintained through strong ties, playing a major role in the diffusion of information. This is interesting for the analysis of contact recommendation, since recommending weak links to people may have an impact on the novelty and diversity in the flow of information through the network. Granovetter proposed that the novelty of the information and resources comes from a subset of the weak ties called bridges, which provide the only path between two people in the social network. As an additional definition of weak ties, Granovetter also proposed the notion of local bridge: a link in the network which increases the shortest distance between two users in more than one step. This is related to the concept of the redundancy of the links: the number of distance 2 paths between two connected nodes in the network. A local bridge is therefore a link which has no redundancy.

Granovetter’s weak tie definitions are too restrictive in practice in real social networks. In fact, Granovetter (1973) stated that both global and local bridges in a network were only a particular definition of the weak links in the network in terms of structural properties, but there could be more of those links in the network. The giant component phenomenon (described in section 2.2.2) makes the connected component decomposition rather irrelevant, and so are therefore the global bridges connecting them. Even the notion of local bridge can be made more informative: Easley et al.

(2010) propose to generalize the notion of strength of a link in terms of the neighborhood overlap (or embeddedness) of the link, which is computed as:

𝐸𝑚𝑏𝑒𝑑𝑑𝑒𝑑𝑛𝑒𝑠𝑠(𝑢, 𝑣) =|Γ(𝑢) ∩ Γ(𝑣)|

|Γ(𝑢) ∪ Γ(𝑣)| (2.5)

De Meo et al. (2014) proposed a further extension to the concept of global weak tie in terms of its structural properties: he defined as a weak link every edge between two different communities in the graph. Since communities are always restricted to a single connected component, every link between two different components (global bridge) is still considered a weak link in this definition, posing a natural extension to that concept.

2.2.4 Evolution

Social networks are highly dynamic objects, which change over time with the arrival of new people and the development of new interactions between existing nodes in the network. Discovering and understanding the mechanisms in the evolution of those networks over time is one of the prominent problems addressed by network science (Liben-Nowell et al. 2003).

(34)

The studies of the evolution of social networks roughly follow two main approaches: a) the creation of mathematical models that describe the formation and evolution of the network; and b) predicting which edges will form next among the users in the network. Since the second approach, known as link prediction, has also more interpretations, we will describe it in a separate section. In this section, we briefly recall some of the most interesting evolutionary models in social networks.

As we stated, a common approach for studying the evolution of social networks consists in the formulation of simplified mathematical models that describe the formation of macroscopic structural graph properties arising through the addition of nodes and links, such as skewed degree distributions, high clustering coefficients, or small diameters (Newman 2003). Modelling graphs has many uses (Kumar et al. 2000):

many problems may be computationally dificult for general and real graphs, but with a suitable model, we can design, analyze and simulate algorithms under that model instead of trying them over the real networks. The goal is thus for graph models to capture some relevant aspect of the real networks. Furthermore, models may suggest unexpected properties of real graphs which can be verified and exploited.

The simplest model is the so-called random model, proposed by Erdös & Rényi (1959). In this model, starting from a fixed number of nodes, links are created randomly between the different pairs of users. Although it is simple, this model is very limited: it does not allow the addition of new nodes, and the degree distribution of the graph follows a Poisson distribution (which differs from the skewed distributions of social networks).

One of the first, most influential and well-known models is the Preferential Attachment model proposed by Barabàsi & Albert (1999), which provides a simple explanation for the formation of skewed degree distributions. In this model, new nodes progressively join the network, creating links to other nodes with proportional probability to the degree of those nodes. Another method for explaining the skewed degree distribution of real-world networks is the vertex copying model proposed by Kleinberg et al. (1999A). This model states that, if a new user in the network follows someone in the network, it is likely to follow at least a subset of the nodes the followee follows. For each new user, this model selects a node at random, and copies a subset of its outgoing links.

Leskovec et al. (2007) proposed a model for studying two empirical phenomena which occur in the evolution real-world networks: the densification of the degree of the graph (networks increase their average degree, following a power-law pattern as they grow), and the reduction of the effective diameter of the network. They proposed the forest fire model, which exhibits both features. The idea behind this model is the following: a new node 𝑢 creates a link to an existing node in the network, 𝑣. The latter may know users which are of interest to 𝑢, so user 𝑢 explores the set of followees of node 𝑣, randomly linking to a subset of them. Some of the followers of those new followees may in turn be interesting for the user, so a subset of them is selected as additional followees, and so forth. This process is repeated recursively until no new nodes are discovered.

Finally, another interesting evolutionary model is the one proposed by Leskovec et al. (2008), where a set of new users is introduced in the network according to an “arrival function”, which determines the number of those users. Every new user creates a link to an existing node selected with a probability proportional to the degree of the users (as in the preferential attachment model). Once a link is created, the node which has created it