SISTEMA PARA MONITORIZAR LA IMPLEMENTACIÓN DEL PROGRAMA

The method introduced in this chapter aims at filling a methodological gap. Several proposed procedures for network reconstruction, combine different steps to carry out the estimation of topology and of weights, thus introducing biases in the whole procedure. A first source of bias is encountered when a probabilistic recipe for topological reconstruction is forced to produce a single outcome instead of considering the entire ensemble of admissible configurations. This choice implies a null likelihood of reproducing the actual network. A second source of bias is encountered when the weights structure is deterministically imposed via a recipe like the RAS one. Again, this recipe ensures a zero likelihood of reproducing the real underlying network: the probability of correctly “guessing” all the link weights is null, assuming continuous weights. Here we reconcile the two aspects, by providing a recipe that clarifies how weights should be determined, once an algorithm for determining the topology of a given network is implemented (be it either probabilistic or deterministic). Notice that the key concept of our approach, i.e. con- ditional entropy, generalizes traditional approaches which, instead, aim

at jointly determine a given network structure: this is immediately seen by rewriting the probability distribution coming from Shannon entropy maximization as in (2.3).

However, although it is clear how the topology of a network should rep- resent a constrain for the weighted configuration, it is not necessarily true that the weighted configuration has as well an impact of the topology. The CReM assumes only the first dependency, but allows the network topology to be unaware of the weighted configuration.

On a more practical level, the solution of both CReMA and CReMB re-

quire a lower complexity with respect to the DECM. The latter one, in fact, requires the solution of a system of 4N four-wise dependent equa- tion, while CReMAonly involves a system of 2N paired equations. CReMB

on the other hand, despite being formally similar in derivation, does not require to solve any system of equations, since the involved parameters are computed via the recipe (2.21). In addition to this, CReMB shows

better or comparable performance with respect the alternative methods, both in terms of likelihood and confidence interval-based comparisons. For this reason, it is the method we recommend choosing.

Chapter 3

Entropy-based approach to

missing-links prediction

The work presented in this Chapter is based on the publication [52] by F. Parisi, T. Squartini and G. Caldarelli.

Abstract

Link-prediction is an active research field within network theory, aiming at uncovering missing connections or predicting the emergence of future relationships from the observed network structure. This chapter represents our contribution to the stream of research concerning missing links prediction for binary networks. Here, we propose an entropy-based method to predict a given percentage of missing links, by identify- ing them with the most probable non-observed ones. The probability coefficients are computed by solving opportunely defined null-models over the accessible network structure. Upon comparing our likelihood based, local method with the most popular algorithms over a set of economic, financial and food networks, we find ours to perform best, as pointed out

by a number of statistical indicators (e.g. the precision, the area under the ROC curve, etc.). Moreover, the entropy-based formalism adopted in the present work allows us to straightforwardly extend the link-prediction exercise to directed networks as well, thus overcoming one of the main limitations of current algorithms. The higher accuracy achievable by employing these methods - together with their larger flexibility - makes them strong competitors of available link-prediction algorithms.

3.1 Introduction

Link-prediction is an active research field within network theory, aiming at uncovering missing connections (e.g. in incomplete datasets) or predicting the emergence of future relationships from the observed network structure. Loosely speaking, the missing links prediction problem can be stated by asking the following question: given a snapshot of a network, can the next most-likely links to be established be predicted? Such an issue is relevant in many research areas, such as social networks [44, 55, 10, 35], protein networks [8, 60], brain networks [12], etc.

To this aim, several algorithms have been proposed so far. Overall, “recipes” for link-prediction can be classified as belonging to either two main classes, similarity-based algorithms or likelihood-based algorithms [46, 75]. Both classes of algorithms output a list of scores to be assigned to non-observed links: while the similarity-based ones may employ local [7], quasi-local [12, 34, 61, 57, 1, 76] or global information [38, 47, 75] (e.g. the nodes degree, the degree of common neighbours and the length of paths connecting any two nodes, respectively), the likelihood-based ones [29, 69, 51] are defined by a likelihood function whose maximization provides the probability that any two nodes are connected. This is usually achieved by assuming that some kind of benchmark information is known and by treating it as a constraint to account for. An alternative classification distinguishes between algorithms employing purely struc- tural information (either binary or weighted [46]) and algorithms making

use of some kind of external information as well (e.g. nodes attributes [43]).

This chapter represents our contribution to the stream of research concerning missing links prediction for binary networks. A novel algorithm is proposed, building upon a series of results concerning con- strained entropy-maximization [54, 25, 63]. In a nutshell, we advance the hypothesis that the tasks of predicting missing links and reconstructing a given network structure share many similarities worth to be further ex- plored. The method we propose in the present work makes a first step in this direction, by employing entropy-based null-models to approach the link-prediction problem. As a last remark, we notice that while the problem of missing links prediction is usually associated to the problem of spurious links identification, here we only address the former one.

The remainder of the Chapter is organized as follows. In Section 3.2 an overview of the missing links prediction problem is provided, together with a detailed description of the method we propose here. Sec- tion 3.3 contains a synthetic description of the datasets used for testing our methods. In Section 3.4, we compare our method with the most common link-prediction algorithms and we comment on the results in Sec- tion 3.5.

3.2 Methods

In order to fix the formalism, let us briefly reformulate the link-prediction problem ab initio.

Let us indicate with the symbol A the adjacency matrix of the observed network and with the symbol E the corresponding set of observed links: as a consequence, upon indicating with U the set of all nodes pairs, U \ E will be referred to as to the set of non-existent links. In order to fully control a given recipe for link-prediction, the link set is usually partitioned into a training set, ET_{, and a probe set, E}P _{= E \ E}T_{. The former}

is used in the “calibration” phase of a given prediction algorithm, while the latter is used for testing it: links belonging to EP _{are, in fact, re-}

|EP_{| ≡ L}

missthe cardinality of the probe set, corresponding to the num-

ber of missing links. Naturally, the adjacency matrix is partitioned as well: the portion of it corresponding to the training set will be indicated with the symbol AT_{. The union of the missing links set and the non-}

existent links set EN _{= E}P_{∪ U \ E ≡ U \ E}T _{will be referred to as to the}

set of non-observed links.

Link-prediction algorithms output a list of scores to be assigned to non-observed links. Upon indicating with i and j the nodes constitut- ing the extremes of non-observed links, the most traditional recipes are quickly reviewed below. In what follows, we will focus on the algorithms employing either local or quasi-local information.

Link-prediction for undirected networks

• The simplest recipe to define scores is based the number of com-

mon neighbours(CN) of i and j

sCN_ij = |Γ(i) ∩ Γ(j)|; (3.1) • a slightly more elaborate function of it is represented by the Jaccard coefficient (J), which discounts the information encoded into the size of the nodes neighbourhoods:

sJij = |Γ(i) ∩ Γ(j)| |Γ(i) ∪ Γ(j)| = sCN_ij ki+ kj− sCNij ; (3.2)

• algorithms based on the information provided by nodes degrees exist. The simplest example is provided by the one inspired to the preferential attachment (PA) mechanism, whose generic score reads

sP Aij = ki· kj; (3.3)

• other, instead, are defined by the inverse of some kind of function of the neighbours degree (according to the original Adamic-Adar -

AA- prescription or subsequent variations, as the resource alloca- tion- RA - one) sRAij = X l∈Γ(i)∩Γ(j) 1 kl , sAAij = X l∈Γ(i)∩Γ(j) 1 ln kl ; (3.4)

• modifications of the aforementioned indices have been recently proposed, encoding information on the link density of the neighbour- hood of each pair of nodes. These indices are the so-called CAR- based ones [12] and prescribe to “correct” the scores above by adding a factor |γ(l)|, counting how many neighbours of node l ∈ Γ(i) ∩ Γ(j)are also common neighbours of i and j. More explicitly

sCAR_ij = sCN_ij · X l∈Γ(i)∩Γ(j) |γ(l)| 2 , (3.5) sCJ C_ij = s CAR ij |Γ(i) ∪ Γ(j)|, (3.6) sCP Aij = (ei+ sCARij ) · (ej+ sCARij ), (3.7) sCRA_ij = X l∈Γ(i)∩Γ(j) γl kl , (3.8) sCAA_ij = X l∈Γ(i)∩Γ(j) γl ln kl (3.9)

where eiindicates the external degree of node i, i.e. the number of

neighbours of i that are not neighbours of j.

Entropy-based approach to link-prediction

The rationale of our method is based upon the concept of network reconstructability. In other words, provided that the accessible portion AT _of

a network is satisfactorily reproduced by a given amount of topological information, it is reasonable to suppose that the latter allows the inaccessi- ble portion to be inferred with reasonable accuracy as well. Invoking the

aforementioned concept allows us to rephrase the link imputation problem within the network reconstruction framework, making it possible to employ the techniques developed there.

From a technical point of view, our algorithm is a local, likelihood- based one. It rests upon the information provided by local, topological quantities, which are enforced as constraints of a maximization procedure defined within the Exponential Random Graph (ERG) framework [54, 63]. In the case of binary, undirected networks, constraints are represented by nodes degrees, i.e. ~k(AT₎_{and the ERG framework leads to}

the maximization of the likelihood function L = ln P (AT₎_where

P (AT) =Y i<j paij ij (1 − pij)1−aij (3.10) and pij = xixj

1+xixj. The numerical value of the unknown coefficients ~x is

obtained upon solving the system of equations ki(AT) = X j(6=i) pij= X j(6=i) xixj 1 + xixj ∀ i (3.11)

(Section 1.3.1 shows the derivation of the condition above in the directed case, the undirected case represents a straightforward simplification). Our algorithm, which is trained on AT_{, prescribes to interpret the proba-}

bility coefficients {pij}ij∈ENassigned to the non-observed links, as scores

to carry out the link-prediction: upon sorting the coefficients {pij}ij∈EN

in decreasing order, the first Lmisslargest ones are naturally interpreted

as pointing out the Lmiss most probable missing links (notice that such a

prescription is based on the assumption that the number of missing links is known, although their identity is not: as a consequence, this number is retained). In other words, the reconstructability assumption underlying our method leads us to interpret the non-observed links which have been assigned the largest probability coefficients as the ones that are most likely to appear given the chosen constraints.

Our recipe has a remarkable, equivalent formulation. In fact, the sub- set Σ∗of Lmisslinks characterized by the largest probability coefficients

Σ∗=_Σ:|E|(Σ)=Largmax missP (Σ|A T₎ (3.12) with P (Σ|AT_{) =} Q i<j ij∈EN pσij

ij (1 − pij)σij. Since the maximum value of

such a product is achieved once the Lmisslargest factors are selected, the

generic entry σij∗ obeys the following rule: σ∗ij = 1if ij belongs to the

set of Lmissmost probable missing links and σij∗ = 0otherwise; in other

words, Σ∗is the subgraph with largest probability among the ones with precisely Lmisslinks. In the remainder of the chapter, this approach will

be named after the null-model employed to calculate the link scores, i.e. UBCM (Undirected Binary Configuration Model) [63].

Link-prediction for directed networks

Remarkably, our algorithm can be generalized to approach the missing links prediction problem in directed networks as well. It is enough to maximize the likelihood L = ln P (AT₎_{where, now, P (A}T_{) =}Q

i6=jp aij

ij (1−

pij)1−aij by solving the system of equations

( kout i (AT) = P j(6=i)pij= P j(6=i) xiyj 1+xiyj ∀ i kin i (AT) = P j(6=i)pji=Pj(6=i) xjyi 1+xjyi ∀ i (3.13) and consider the coefficients {pij}ij∈EN as scores to be assigned to the

non-observed links (see the Section 1.3.1 for the derivation of the condition above). The proper prediction step is still carried out by applying the recipe defined by eq. 3.12, with the only difference that, now, the product runs over the directed pairs of nodes. In the remainder of the chapter, this approach will be named after the null-model employed to calculate the link scores, i.e. DBCM (Directed Binary Configuration Model) [63].

Notice, instead, that no unambiguous ways to generalize traditional scores exist. Here we have adopted the (directed) extensions listed below, with the aim of accounting for link directionality whenever possible:

• when considering directed networks, the concept of common neigh-

cessors”, i.e. the nodes respectively “pointed by” and “pointing to” a given node. Upon indicating the set of “successors” of i with ΓS

and the set of “predecessors” of j with ΓP, the CN index can be

generalized as follows

sCN_ij = |ΓS(i) ∩ ΓP(j)|; (3.14)

• building upon the directed version of the CN index, the Jaccard index reads sJij= |ΓS(i) ∩ ΓP(j)| |ΓS(i) ∪ ΓP(j)| = s CN ij kout i + k in j − s CN ij ; (3.15) • the RA and AA indices can be straightforwardly generalized as fol-

lows: sRA_ij = X l∈Γ(i)∩Γ(j) 1 ktot l , sAA_ij = X l∈Γ(i)∩Γ(j) 1 ln ktot l (3.16) with ktot i = kouti + kiin;

• the PA score admits two different generalizations: one employing the total degree of nodes

sP A0I ij = k tot i · k tot j (3.17)

and the other employing the nodes out- and in-degree sP A 0 II ij = k out i · k in j ; (3.18)

• while the CAR-based indices are not straightforwardly generaliz- able to the directed case, other scores exist aiming at extending the concept of “closed triad” to account for link directionality [59], the

triadic closureindex (TC) is defined as: sT C_ij = X

l∈Γ(i)∩Γ(j)

here, the “triad weight” wi,j,l = #Ti→j,l_#T+#Ti↔j,l

i,j,l is defined by the

(global) number #Ti,j,l of observed, open triads of the particular

kind Ti,j,l, the (global) number #Ti→j,l of observed, closed triads

via a directed link from i to j and the (global) number #Ti↔j,l of

observed, closed triads via a reciprocal link between i and j; w(l) is, instead, a node-specific weight that can be set either to w(l) = 1

kl or

to 1. In order to avoid misinterpretations, we set the weight to 1.

Testing link-prediction

Once a link-prediction algorithm has been defined, a number of statistical indices exist to test its effectiveness. In what follows we will briefly review the ones we have employed in the present work to compare the aforementioned algorithms. The first index we have considered is the true positive rate (also known with the name of precision), defined as

TPR = Lr Lmiss

(3.20) and quantifying the percentage of missing links that are correctly recov- ered (i.e. the number Lrof rightly identified missing links within the list

of the first Lmisslinks with the largest score). A similar-in-spirit index is

the accuracy

ACC =Lr+ Lne

|EN_| , (3.21)

quantifying the percentage of correctly classified links (i.e. both the missing ones and the non-existent ones) with respect to the total number of non-observed links. The third index we consider is the traditional area under the ROC curve, or AUC, proxied by the number

AUC = n

0_{+ n}00_/2

n ; (3.22)

n0counts the number of times a missing-link is assigned a higher probability than to a non-existent one, while n00 accounts for the number of times they are assigned an equal probability. The denominator n coin- cides with the total number of comparisons (i.e. the number of missing

links times the number of non-existent links). This index is intended to quantify the probability that any missing-link is assigned a score that is larger than the score assigned to any non-observed link. If all scores were i.i.d. the AUC value should be distributed around an expected value of 1/2: therefore, the extent to which the AUC value exceeds 0.5 provides an indication of how much better the algorithm performs than pure chance. The set of missing links is usually randomly removed: we have fol- lowed such a procedure, by 1) randomly removing the 10% of links 10 times, 2) quantifying the performance of the algorithms above, by com- puting the three aforementioned indices over each sample, 3) averaging these values over the sample set (the sample standard deviation is used to proxy the estimation error.)

In document PROGRAMA NACIONAL DE CONTROL DE DETERMINADOS SEROTIPOS DE Salmonella EN EN PAVOS DE ENGORDE Y REPRODUCCIÓN 2021 (página 39-42)