Información Empresa de Gestión Medioambiental S.A 1 INTRODUCCIÓN
2. LA ESTRATEGIA DE LA CONSEJERÍA DE MEDIO AMBIENTE
2.2 El Canal de la Red de Información Ambiental de Andalucía en WEB
Arguably, many nodes in real-world networks do not truly belong to any community. In this thesis, these nodes are termed “background”, as in Wilson et al. (2014). Within the NST framework, the notion of a background node can be formally defined as follows:
Definition 2. Let G be a random network with distribution P. Let a(u, B) be the population association between u and B corresponding to P. Then u∈[n] is a background node if and only if a(u, B) = 0 for all B⊆[n].
Remark. Note that a(u, B) < 0 would imply that u is negatively associated with B, and is therefore not an example of null behavior. Node sets that are mutually negatively associated are often said exhibit “disassortive” structure, which, though important in some contexts (Aicher et al., 2014), is not as often of scientific interest. Though the approach introduced in this chapter focuses on assortive structure, it lays the groundwork for extraction of disassortive communities, as well. Extraction of disassortive communities could be accomplished by using left-tail rather than right-tail p-values.
A real-world example of a background node in a network otherwise laden with communities is a happenstance friend in a social network. Suppose you travel to a new city, and happen to meet and spend some time with a local, or maybe fellow traveler. Subsequently, you decide to “friend” this person on an online social platform. It is unlikely that your new friend will have significant connectivity to your existing friend groups, like your college or high school friends, or your work friends. Nonetheless, upon analysis of your social network, the classical partition-based community detection methods discussed in Sections 1.2.1-1.2.4 will force this node into some community.
In contrast, the SCS algorithm is naturally suited to handle background nodes. Recall that, by definition, for any background node u ∈ [n], we have a(u, B) = 0 for all B ⊆ [n]. Thus, for any B ⊆ [n], the statistic T(u, B,D) will follow the null model, hence p(u, B,D) ∼ U[0,1]. If
Uα employs an appropriate multiple testing rule, the proportion of background nodes inUα(B,G) should therefore be bounded in either expectation or probability, depending on the rule. The classical multiple-testing rule is the well-known Benjamini-Hochberg procedure (Benjamini and Hochberg, 1995), defined as follows:
1. Given a set of p-valuesp:={pu}u∈[n] and a target FDRα∈(0,1).
2. Calculate the adjusted p-valuesp∗u :=n pu/ j(u), wherej(u) is the rank of pu in p. 3. Compute threshold τ(p) := max{pu :p∗u 6α}.
Benjamini and Hochberg (1995) show that if the p-values are mutually independent, this procedure ensures that the expected proportion of false discoveries, or in our context, the expected proportion of background nodes inUα(B,G), is no greater thanα.
2.4.1 Global Error Control
While use of the multiple-testing rule has direct implications about false discoveries in SCS output sets, its implications for the global error of SCS are not immediately clear. In particular, we define anull random network as follows:
Definition 3. Let G be a random network with distribution P. Let a(u, B) be the population association betweenuandB corresponding toP. ThenGis anullnetwork if and only ifa(u, B) = 0 for all u∈[n] andB ⊆[n].
Note that, by Definition 2, G consists completely of background nodes. We now consider the probability that SCS recovers any stable community in a null network. Ideally, any application of the SCS algorithm on such a network will converge to the empty set. However, this is not guaranteed, due to random fluctuations in the data. Explicitly, let G = ([n],D) be a random network with distribution P. Define the set of stable communities inG as
C(D, α) :={B ⊆[n] :Uα(B,D) =B} (2.8)
We define global Type I error at level α as P(C(|D, α)|>0). One key assumption is needed on P to bound the Type-I error atα.
Assumption 1. For B ⊆[n], denote the set of p-values used by the update Uα by
p(B) :={p(1, B,D), . . . , p(n, B,D)}.
Assume that underP, for allB ⊆[n], the p-valuesp(B)are independent and uniformly distributed. The following theorem establishes that, under Assumption 1, the Type-I error is bounded by α if
Uα uses the Benjamini-Hochberg rule. The proof is given in Section A.2.
Theorem 4. (OST global error control)
Fix α∈[0,1] and n >1. Let G = ([n],D) be a random network with distribution Pn. Assume Pn satisfies Assumption 1. Then if Uα uses the Benjamini-Hochberg multiple testing procedure, P |C(D, α)|>0 6 α.
2.4.2 Discussion
The impact of Theorem 4 is powerful. First, note that the theorem makes no reference to the type of network data. It depends only on the existence of a global null model that provides uniform and independent p-values for the node-set test statistic. Second, the theorem depends only on the updateUα(·,G), not the SCS algorithm overall. Thus, the SCS algorithm should be viewed simply asone way to search for fixed points. Importantly, the choice of initialization method for the SCS algorithm, or ways to resolve cycles (discussed at the end of Section 2.3) do not have an effect
on global Type-I error. Theorem 4 regards the probability of the existence of fixed points, which always upper-bounds our probability of finding them.
Of course, Theorem 4 has limitations. First, Assumption 1 is almost never satisfied exactly, even when the network is completely null as in Definition 3. Small dependencies between the tests arise due to inherent dependencies in network data. That said, in some simple cases it is easy to see that these dependencies vanish uniformly with the size of the network. Consider a directed binary networkG := ([n], A), noting that here, Ais not symmetric. Without loss of generality we assume the rows ofA contain the in-edges indicators of the network. The natural node-to-set association statistic in this setting is then the in-degree ofu toB, defined
~
D(u, B, A) := X v∈B
A[u, v].
Consider that the statistics{D~(1, B, A), . . . , ~D(n, B, A)}are, in this setting, mutually independent. Therefore, if the p-values p(B) := {p(1, B, A), . . . , p(n, B, A)}, were exact, they would also be mutually independent. However, as with the binary-network example laid out in the previous sections, a reasonable null model in this setting depends on the expected degrees of the assumed generative model of G, which are unknown and must be estimated from the data. In particular, each p-value p(u, B, A) is a function of (and only of) D~(u, B, A) and the observed in degrees {D~(1), . . . , ~D(n)} where D~(u) := D~(u,[n], A) for all u ∈ [n]. Thus, for any finite n, the p-values are dependent through the observed degrees.
However, asnapproaches infinity, these dependencies vanish uniformly, since (in many standard cases) the observed degrees will approach their limiting values. Thus, loosely speaking, the p-values will be asymptotically exact and mutually independent. For instance, suppose that the unknown generative model of G is a directed Erd˝os-R´enyi network with edge probability 0.5. Under this model, D~(u) is a Binomial(n,1/2) random variable, and from Bernstein’s Inequality and a union bound it follows that
max u∈[n]
n
|D~(u)−n/2|o−→p 0 as n→ ∞.
In practice, test-wise dependency issues for finite n can potentially be resolved by using more stringent FDR control procedures, like that proposed by Benjamini and Yekutieli (2001). Variants of Theorem 4 that involve these procedures are immediate areas for future work.
Another issue that limits the scope of Theorem 4 is that evaluation of tail probabilities for reasonable node-set test statistics T often involve asymptotic distributional approximations. This means that for finiten and any node setB, p-values will not be precisely uniform under the null. The non-uniformity issue cannot be solved in generality, since every application of SCS involves a different test statistic and, therefore, a different method (approximate or otherwise) of calculating p-values. Thus, Theorem 4 does not perfectly guarantee Type-I error under a global null in any application that involves a distributional approximation. However,empirical Type-I error control for the methods presented in this paper has been verified in a variety of simulation settings, as will be shown in subsequent chapters.
CHAPTER 3
Continuous Configuration Model Extraction
In this chapter, the Node-Set Testing framework introduced in Chapter 2 is applied to the problem of community detection on weighted networks. The centerpiece of this application is the introduction of a weighted network null model that allows for arbitrary degrees and (separately) arbitrary weighted degrees. Such a model is currently absent from the literature. The null, called the “continuous configuration model”, allows for rigorous statistical tests of graph statistics for weighted networks. In particular, we define an NST test statistic for weighted networks, and apply the continuous configuration model to yield a testing-based extraction method called CCME. The continuous configuration model also serves as a useful tool for simulation. A benchmarking frame- work introduced in the following sections involves simulated networks with both communities and background nodes simulated under the null. Such networks with both communities and background are crucial for validating the performance of community detection methods on realistic data.
The rest of this chapter is organized as follows. Chapter-specific notation is introduced in Section 3.1. In Section 3.2, the continuous configuration model is motivated and stated. In Section 3.3, the NST test statistic is defined, and theoretical results are given establishing its limiting distribution and asymptotic consistency properties. The overall implementation and application of the core NST algorithm within CCME is described in Section 3.4. Evaluations of CCME’s empirical efficacy on simulations and real data are presented in Sections 3.5 and 3.6 (respectively). A discussion is offered in Section 3.7.