Sancionar penalmente a las personas jurídicas

2. De la corrupción en general

2.4 De las medidas punitivas contra la corrupción

2.4.2. Sancionar penalmente a las personas jurídicas

itself. It has been verified up to a small error.

On the other hand, that the Fisher information matrix can be approximated by I finds is not well justified. This point is discussed further in section 5.2 (p. 66).

4.4 Implementation of the IID Kernel

In practise, (4.3.4) is more efficiently computed by taking into account that α(z) depends only on z (neither on d nor on w): α(z) and ζ (d, z) can be pre-computed once for all for the entire corpus. γ (w, z) and ζ (q, z) are best computed for each query, as to take advantage of the limited number of different terms present in a query: γ(w, z) is computed only for w ∈ q (i.e. w ∈ V such as n(w, q) > 0). We have observed that processing is an order of magnitude quicker using these precomputations.

Furthermore, the computation of all Fisher kernels can be decomposed into two independent parts K_z andK_w which stem from the contributions of the latent categories and of the terms, respectively; for instance using (4.3.4):

KIID_{(d, q) = K}IID z (d, q) + KwIID(d, q), where KIID z (d, q) = 1 |d| 1 |q| X z∈Z P(d|z)P(q|z)α(z)ζ (d, z)ζ (q, z) , and K_wIID(d, q) = 1 |d| 1 |q| X z∈Z  P(d|z)P(q|z) ·X w∈V n(d, w) P(d, w) n(q, w) P(q, w)γ (w, z)   .

In spite of the optimisation that we present here, computation of the IID kernel remains quite costly, as it yields a double loop on the terms of the queryand the terms of the documents. In this perspective, it is interesting that a kernel such as Hofmann’s would exist, as it gives insights for further approximations that would combine the theoretical foundations and the good results of the IID kernel with the computational advantages of Hofmann’s kernel.

4.5 Conclusions

We developed the true Fisher kernel for PLSI, based on the IID nature of its generative process. The IID Fisher kernel features normalisation by document length and contributions of the Fisher information matrix; in Hofmann’s original kernel, the first was present but not justified beyond an analogy to

tf-idfmodels, while the second was absent.

The resulting kernel outperforms Hofmann’s original kernel (see chapter 7, p. 81). It is much more expensive to compute, but the two kernels can be linked by four simplifying assumptions (table 4.1, p. 60). Examining these assumptions could result in a kernel combining the merits of both IID and Hofmann’s kernel; this is the object of the following chapter.

Chapter 5

The components of the Fisher kernel

for PLSI and their roles

Abstract

The previous chapter has heralded theoretically founded similarities for PLSI by deriving the Fisher kernel that stems directly from the IID nature of the PLSI generative process. In this chapter, we study in detail the contributions of the various terms that compose the kernel. In particular, we study the role of the Fisher information matrix, neglected by Hofmann. We also address the Fisher kernels “word component”K_w and “category components”K_z.

These studies allow us to answer the questions raised by Nyffenegger et al. [61]: what are the consequences of the normalisation of the document log-likelihood by document length, introduced by Hofmann in his legacy kernel? Should the Fisher information matrix be approximated by the identity, as Hofmann does, or not? We also address the so-called “Vector Space” kernel that attempted to bring balance in the Fisher kernel by normalising only itsK_z part.

This study vindicates the validity of the normalisation by document length as introduced by Hofmann, and confirms the necessity to account for the Fisher information matrix.

5.1 Ratios between kernel components

In the preceding chapter, we introduced the “term dimensions”K_wand the “categories dimensions”K_z of the Fisher kernel. The complete Fisher kernel is a sum of these two terms:

K(d, q) = K_z(d, q) + K_w(d, q)

Nyffenegger et al. [61] studied in which proportions these two components contribute to the overall retrieval performance. This question stems from a normalisation by document length in Hof- mann’s legacy kernel [28], justified by analogy with thetf-idfmodel but not strongly theoretically

grounded1_{. Questioning this normalisation led them to propose two alternative variants of the Fisher}

kernel:

1_{In chapter 4, we have shown how regarding PLSI as a stochastic IID process explains the origin of these terms, and justifies}

their introduction in a simplified kernel.

64 CHAPTER 5. COMPONENTS OF PLSI FISHER KERNELS K_z K_w Unnormalised kernel [61] KU z KwU Hofmann [28] 1 P(d)P(q)KzU |C |2 |d ||q|KwU Vector Space [61] KU z |C | 2 |d ||q|KwU

Table 5.1: Components of Hofmann’s kernelKH_{and the Vector Space kernel}_KVS_{with respect to those}

of the kernelKU_{derived from unnormalised log-likelihoods}

1. an un-normalised kernel, denotedKU

2. a hybrid derivative calledVector Space (VS), in which only the second part of the kernel is normalised. This version is not justified beyond experimental considerations.

Nyffenegger et al. go on to compareKH_{and the Vector Space Kernel}_KVS_{— though giving no results}

forKU_.

We take a step further by evaluating KU_{, the kernel where normalisation is removed from both}

components. Furthermore, for all these kernels, we provide experimental results for the components “categories part” and the “term part” alone (denotedKH

z andKwH, and similarly for U and VS).

As mentioned in section 4.2.1, the term that Hofmann uses in his derivation is not exactly the log- likelihood for documentsL_d_b

0(θ) (eq 4.2.2), but rather Ldb0(θ)/|d|, which comprises a normalisation by

document length. His legacy kernel,KH_{, stems from this basis.}

Derivation of the kernel from an un-normalised log-likelihood led to the kernels KU _and _KVS_.

These normalisations of log-likelihoods are echoed more of less explicitly in the kernel componentsK_z andK_w; table 5.1 shows theK_zandK_wcomponents for Hofmann’s kernelKH_{, the Vector Space kernel}

KVS_{and the unnormalised kernel}_KU_{, taking}_KU_{as a reference.}

Given the typical orders of magnitude of |C| (size of the document collection) versus typical |d| and |q|, it quite clear that the VS kernel is largely dominated by its Kwcomponent, which it inherits

fromKH_:

KVS

≃ K_wH (5.1.1)

This was confirmed experimentally, as shown in detail in chapter 7.

When the log-likelihood is multiplied by a constant, the Fisher kernel is left unchanged. However, if the log-likelihood is multiplied by a function ofd not depending on the parameters, i.e. lnew

d (θ) = λ(d)l_d(_{θ), s.t. ∇θ}λ = 0, then Gnew₍_{θ) = E} d (_∇_ρl_dnew(_{ρ)) (∇ρ}l_dnew(ρ))T =E_dλ(d)2(∇ρld(ρ)) (∇ρld(ρ))T , and Knew θ (d, q) = λ(d)λ(q)(∇θld(θ))TGnew(θ)−1(∇θld(θ)),

which cannot, in the most general case, be expressed in terms ofK_θ(d, q).

However, if the contribution of the Fisher information matrixG(θ) is neglected (i.e. if G(θ) ≃ 1), as in Hofmann’s derivation, then the two kernels do come in relation:

Knew

5.1. RATIOS BETWEEN KERNEL COMPONENTS 65 Thus, under the assumption thatG(θ) is approximated by I, we expect to find

KH_{(d, q) =} 1

|d| |q|K

U_{(d, q).}

The reason why this is not verified in table 5.1 stems from slightly different assumptions made in their computations: with assumption 4.3.2 (p. 59), Hofmann states that

b P(w|d)

P(w|d)P(w|z) ≈ 1 ,

where Nyffenegger et al. make a similar but slightly different assumption on thejoint probabilities rather than on theconditional probabilities:

b P(d, w)

P(d, w)P(w|z) ≈ 1. The two relations are linked by

X w b P(w|d) P(w|d)P(w|z) = P(d) |C | |d| X w b P(d, w) P(d, w)P(w|z).

Replacing the first approximation with the second amounts to turning |d|/|C| into P(d), well in line with assumption 4.3.5 (p. 59). Substitution leads directly to the formulae of table 5.1, as expected.

We could thus consider three similar “Hofmann”-like kernels: KH _{as originally computed and}

described above, but also KH1 ₌ |C |2 |d | |q|K

U_{, which has the same} _k

z as KH but a different kw, and

KH2₌ 1

P(d) P(q)KU, which has the samekwasKHbut a differentkz.

We evaluated these three kernels and no change occurred in the results; observed differences in the document similarities were in the order of 10−4_{%. This is because |d|/|C| is indeed a very good}

estimator ofP(d) =P_zP(d|z)P(z).

66 CHAPTER 5. COMPONENTS OF PLSI FISHER KERNELS

In document La aplicación de medidas cautelares en el ejercicio de la acción de extinción de dominio y su eficacia en la extinción de la acción en los delitos de corrupción en El Salvador (página 132-138)