• No se han encontrado resultados

The Jeffreys’ prior was the first attempt for constructing priors after about 200 years of using uniform priors of Bayes and Laplace [44, 45]. The Jeffreys’ prior is the one with the following property [45]

π(θ)pI[θ] = s dethE[ ∂ 2 ∂θiθj log f (x|θ)i i,j (5.3)

where I(θ) is the Fisher Information of the parameter θ. The main drawback of the Jeffreys’ prior is that in many cases it yields an improper prior which does not generate a proper posterior probability. It also shows some other unexpected behaviors which limits its applicability [103]. Nonetheless, in the discrete scenario of multinomial model for the likelihood, one would obtain a proper prior given by

π(θi)∝ θi−1/2, (5.4)

which is equivalent to a Dirichlet distribution with αi = 12,∀i = 1, . . . , b.

the last seven decades have been concentrated on an optimization view for the prior construction where the objective function is chosen so that it represents some type of measure of the information. In this work, we are mainly interested in regularization framework, where we introduce it in the general framework. The notion of regu- larization goes back to solving ill-posed integral equation by Tikhanov [31]. After Tikhanov, there has been an extensive number of works either in statistics, signal processing, or machine learning dealing with regularization [28, 30, 104].

In the general framework, we define a regularized objective-based informative prior probability as the solution to the following optimization problem

min π(θ) − (1 − λ T1)E θ[g0(θ)] + λTL(ξ), Subject to:                0m  ξ Eθ[gi(θ)]≤ ξi; i = 1, ..., m R Θπ(dθ) = 1 (5.5)

in which the function Eθ[g0(θ)] is an information measuring term, e.g. the negative

entropy where the function g0(θ) = ln π(θ), and L(ξ) is a linear function on the slackness variables ξ. The vector ξ, encompasses the slackness variables which are also optimization parameters, i.e. it is an ordered representation integrating all the variables in the forms of ξia, ξir, ξcaij, ξijcr, ξireg. Define,

L(ξ) = [ |C| X i=1 ξireg, X (ij)∈Ga ξija + X (ij)∈Gr ξijr + X (ijk)∈Gca ξijkca + X (ijk)∈Gcr ξijkcr ]

In (5.5), the vector λ = [λreg, λf un], for which we have λT1

≤ 1 and λ  0, is the regularization vector (or the design parameter) depending on the relative importance

of different sources of information. The constraints Eθ[gi(θ)]; i = 1, . . . , m, are ex-

tracted from the prior knowledge. In the case of restricted prior probability family, Π, parametrized by a vector α (hyperparameters), we define

f0(α, ξ) =−(1 − λT1)Eθ[g0(θ)|α] + λTL(ξ)

fi(α, ξ) = Eθ[gi(θ)|α] − ξi ≤ 0; i = 1, ..., m.

Then, the optimization problem can be rewritten as follows

min π(θ|α)∈Π f0(α, ξ) Subject to:        fi(α, ξ)≤ 0; i = 1, ..., m 0m  ξ (5.6)

where Π is the feasible region to which the prior distribution belongs. From equa- tion (5.6), one can see that in the parametric prior family, e.g. Dirichlet distributions, the objective function and constraints are reduced to functions of only the parameter vector α and the slackness variables. Since the regularization parameters are used to make a balance between different sources of information, we assume that for each ”type” of prior knowledge, the corresponding element in the vector λ are equal. In other words, for all ξija, ξijr, ξijkca, and ξijkcr we assume one regularization parameter de- noted by λf un to emphasize on the ”functional” essence of this type of information. Similarly for the regulatory set information, we use λreg. Hence, the term λTL(ξ), in equation (5.6) can be expanded as follows

λregX i∈C ξireg+ λf unh X (i,j)∈Ga∪Gr ξija + ξijr + X (i,j,k)∈Gca∪Gcr ξijkca + ξijkcr i (5.7)

follows Eθ h Pr(xj = 1|xi = 1,xj,rep = 0) i ≥ 1 − ξa ij; ∀(i, j) ∈ Ga (5.8a) Eθ h Pr(xj = 0|xi = 1) i ≥ 1 − ξijr, ; ∀(i, j) ∈ Gr (5.8b) Eθ h Pr(xj = 1|xi = 1, xk = 0,xj,rep = 0) i ≥ 1 − ξca ijk; ∀(i, j, k) ∈ Gca (5.8c) Eθ h Pr(xj = 0|xi = 1, xk = 0) i ≥ 1 − ξcr ijk; ∀(i, j, k) ∈ Gcr (5.8d) Eθ h Hθ[xi|Rxi] i ≤ ξireg; xi ∈ C (5.8e)

In the following subsections, we consider 3 constructive methods to select prior probabilities compatible with the available prior information. The first two methods are traditionally introduce for construct least-informative priors. We adopt these methods, and modify them.

5.2.1 Regularized Maximum-Entropy Priors

The principle of maximum-entropy was first stated in statistical mechanics al- most 55 years ago by Jaynes in [55] as an inference method [105]. This is used for the probability construction of the different (random) states (in the state space) that can be taken, i.e., microscopic states of the system. In statistical mechan- ics the state functions are random, due to the randomness in the states, and only some mean values of these state functions can be measured [106]. In this way, the maximum entropy probability is the one whose (information) entropy is maximized subject to these mean values: It leaves us with the greatest uncertainty given the constraint in order to prevent adding spurious information. Mathematically speak- ing, inserting g0(θ) = − ln π(θ) in (5.6) with λ = 0 (i.e., no slack variables) and

to Appendix D.2 for details). Incorporating the prior knowledge, we extend the no- tion of maximum entropy probability into the Bayesian setting with slack variables as in (5.6), where the objective function is given by

f0(α, ξ) =−(1 − λT1)H[θ] + λTL(ξ). (5.9)

Inferring from equation (5.9), the RMEP objective function makes a balance between the negative entropy and the knowledge obtained from signaling pathways.

5.2.2 Regularized Maximal Data Information Priors

The maximal data information prior (MDIP) is introduced by Zellner, et. al. [107]. Zellner’s choice of objective function is a criterion for prior probability construction to remain ”maximally committed to the data” [103]. Adopting the original method into the new framework, the MDIP is the one with

g0(θ) =−[ln π(θ) + H[p(x|θ)]],

in which p(x|θ) is the likelihood of x when it is parameterized by θ. Taking the expectation with respect to θ, we obtain

Eθ[g0(θ)] = H[θ]− Eθ[H[p(x|θ)]].

In the MDIP, ”data” does not mean any actual observation, rather it is used and then marginalized by finding the entropy (refer to Appendix D.2 for details). Similar to the RMEP, the regularized extension of MDIP (RMDIP) is the solution to the

optimization problem in (5.6) in which f0(α, ξ) =−(1 − λT1) h H[θ]− Eθ[H[f (x|θ)] i +λTL(ξ). (5.10)

Going from (5.9) to (5.10), one can see that the entropy is subtracted by the prior- average of the entropy of the likelihood of data.

5.2.3 Regularized Expected Mean Log-Likelihood Priors

The general framework for the regularized expected mean log-likelihood priors (REMLP) is detailed in Section 4. The main difference between regularized expected mean log-likelihood prior (REMLP) with its preceding methods, is the way it takes the observations into account [41]. Prior to introducing the REML, all the prior constructing methods were maximally ignorant to the observations (measurements). However, the REML optimization problem searches for the priors which are designed to “remain committed to some part of the sample data” through the expected mean- log-likelihood function while satisfying the constraints imposed by the pathways. The expectation of the log-likelihood is taken with respect to the prior, to marginalize the dependency of the mean-log-likelihood to the actual feature-label distribution parameters and map it to the hyperparameter space. Henceforth, for notational ease, we drop the index y.

To this end, we first split the given sample, u, into two parts: uprior and utrain,

with |u| = |uprior| + |utrain|, where the former is used for prior construction. Here, we restate the REMLP in the general regularized framework given in equation (5.6). The REML prior (REMPL) is found by solving the optimization problem in (5.6) when the objective function is given by

f0(α, ξ) =−(1 − λT1)Eθ[`np(θ; u

where `np(θ; uprior) is the log-likelihood function of the samples uprior. In [41] it

has been shown that the variable `np(θ; uprior) can be interpreted as a measure of

“similarity” between the true model and the one governed by the parameterθ. This is similar to the Akaike’s information criterion for model selection [89].