III MATERIALES Y MÉTODOS
11 Análisis de los datos de secuencia
3.2 Estudio del gen nicC
We first define the standard conditional composite likelihood function. For i = (iv, v ∈ V ), let X(1), . . . , X(N ) be a sample of size N from the distribution of X, which belongs to a hierarchical log-linear model M∆. We recall that the global log-likelihood function is
l(θ) ∝ N X
i=1
log p(X(i)) = hθ, ti − N k(θ) (4.1.1)
For a given vertex v ∈ V , let Nv be the set of neighbours of v in the given graph G. The composite likelihood function based on the local conditional distribution of Xv given XV\{v} or equivalently, due to the Markov property, the conditional distribution of Xv given its neighbours XNv is L
P S(θ) = Q v∈V Lv,P S(θ) where Lv,P S(θ) = N Y i=1 p(Xv(i)|XN(i)v; θ) (4.1.2) and the superscript ”P S” stands for ”pseudo-likelihood”, the name often given to the conditional composite likelihood (Besag (1974)). As given by (2.1.4), for a given cell i, we have
log p(i) = log p(Xv = iv, v ∈ V ) = θ0+ X j/i θj = θ0+ X j/i, S(j)⊆v∪Nv,S(j)6⊆Nv θj + X j/i, S(j)⊆Nv θj + X j/i, S(j)6⊆v∪Nv θj
Let
JP Sv = {j ∈ J | S(j) ⊆ v ∪ N
v, S(j) 6⊆ Nv} = {j ∈ J | v ∈ S(j)},
next we show that elements of set JP Sv index the parameters in the v-th component in the condi-
tional likelihood function, i.e. p(Xv(i)|XN(i)v). For iv 6= 0, we have
p(Xv = iv| XNv = iNv) = p(Xv = iv| XV\{v}= iV\{v}) = p(XV = iV) p(XV\{v}= iV\{v})
= e
θ0+Pj/i, j∈JP Svθj+Pj/i, S(j)⊆Nvθj+Pj/i, S(j)6⊆v∪Nvθj
P k∈I| kV\{v}=iV\{v} eθ0+Pj/k, j∈JP Svθj+Pj/k, S(j)⊆Nvθj+Pj/k, S(j)6⊆v∪Nvθj = e P j/i, j∈JP Svθj 1 +P k∈I| kV\{v}=iV\{v}, kv6=0e P j/k, j∈JP Svθj (4.1.3) and p(Xv = 0| XV\{v} = iV\{v}) = 1 1 +P k∈I| kV\{v}=iV\{v}, kv6=0e P j/k, j∈JP Svθj (4.1.4)
Equality (4.1.3) is due to the fact that the set of j ∈ J such that j / k, S(j) 6⊆ v ∪ Nv, is the same whether kv = iv or kv 6= iv, and therefore the term e
θ0+Pj/k, S(j)6⊆kv∪Nvθj
cancels out at the numerator and the denominator. The same goes for the set of j ∈ J such that j / k, S(j) ⊆ Nv.
Remark 4.1.1. In the equation above, we worked with p(Xv|XV\{v}) rather than with P (Xv|XNv),
though the two are equal; we did this to emphasize that
θv,P S = (θj, j ∈ JP Sv), v ∈ V (4.1.5) of the v-th component Lv,P S of conditional composite distribution is a sub vector of θ, the parameter of the global likelihood function.
Except for the pseudolikelihood, there are also some other types of conditional composite like- lihood methods. Asuncion et al. (2010) proposed their version of composite likelihood which is the
conditional likelihood of a subset of random variables conditional on another subset. By increasing the size of the local components, the composite likelihood estimation can be made more accurate, but computational complexity is sacrificed. In our research, we modified the pseudo-likelihood based on this idea and proposed the two-hop conditional composite likelihood.
The two-hop conditional composite likelihood function is LP S2(θ) =Q
v∈V Lv,P S2(θ) where Lv,P S2(θ) = N Y i=1 p(Xv(i), XN(i) v|X (i) N2v). (4.1.6)
The expression of p(Xv(i), XN(k)v|XN(k)2v) is the same as (4.1.3) and (4.1.4) but with Jv,P S replaced by Jv,P S2 where
Jv,P S2 = {j ∈ J | S(j) ⊆ M
v, S(j) 6⊆ N2v}. In a parallel way to Remark 4.1.1, we note that
θv,P S2 = {θ
j, j ∈ Jv,P S2}
is a sub vector of θ = (θj, j ∈ J ), the argument of the global likelihood function.
Let Mv be the one-hop or two-hop neighborhood of v. The marginal composite likelihood is the product LM(θ) = Y v∈V N Y k=1 p(XM(k)v) = Y v∈V LMv(θ). (4.1.7) where LMv(θ) = QN k=1p(X (k)
Mv). The Mv-marginal model is clearly multinomial and the corre-
sponding data can be read in the Mv-marginal contingency table obtained from the full table. The density of the Mv-marginal multinomial distribution is of the general exponential form
f (tMv; θMv) = exp{htMv, θMvi − N kMv(θMv)} (4.1.8)
where tMv, θMv and kMv are respectively the M
v-marginal canonical statistic, canonical parameter and cumulate generating function.
In order to identify the Mv-marginal model, we first establish the relationship between θ and θMv. For the remainder of this thesis, the symbol j is to be understood as an element of I
Mv
whenever used in the notation θMv
j , and it is to be understood as the element of J obtained by padding it with entries jV\Mv = 0 whenever used in the notation θj. We now give the general relationship between the parameters of the overall model, and those of the Mv-marginal model. The proof is given in Appendix B.1.
Lemma 4.1.1. Let Mv be the one-hop or two-hop neighborhood of v ∈ V . For j ∈ J, S(j) ⊂ Mv, the parameter θj of the overall model, and the parameter θMj v of the marginal model are linked by the following: θMv j = θj + X j0|j0/0j (−1)|S(j)−S(j0)|log 1 + X i∈I,iMv=j0 expX k|k/i k6/j0 θk (4.1.9)
We want to identify which of the marginal parameters are equal to the corresponding overall parameter, and in particular which marginal parameters are equal to zero when the global parameter is equal to zero. Let Mcv denote the complement of Mv in V . We define the buffer set at v as follows:
Bv = {w ∈ Mv | ∃w0 ∈ Mcv with (w, w0) ∈ E}. (4.1.10)
We have the following result.
Lemma 4.1.2. Let Mv be the one-hop or two-hop neighborhood of v ∈ V . For j ∈ J, S(j) ⊂ Mv the following holds:
(1.) if S(j) 6⊂ Bv, then θMj v = θj,
(2.) if S(j) ⊂ Bv, then in general θjMv 6= θj, and (4.1.9) holds. Moreover, for i ∈ I, S(i) ⊂ Mv,
(3.) If S(i) 6⊂ Bv, then θMv
i = 0 whenever θi = 0.
The proof is given in Appendix B.2. From the lemma above, we see that, for j ∈ J such that S(j) ⊂ Mv, S(j) 6⊂ Bv, the corresponding global and Mv-marginal log-linear parameters are equal. We see also that for i ∈ I such that S(i) ∈ Mv, S(i) 6⊂ Bv, if the log-linear parameter is zero in the global model, it remains zero in the Mv-marginal model.