• No se han encontrado resultados

1. FUNDAMENTOS TEÓRICOS

1.2 Referente conceptual

1.2.4 Educar en Valores para el Siglo XXI

Graphical models have been a very popular tool over the last decade [165], [53], [140], [131]. They allow to handle complicated dependencies between variables of a multi-variate distribution us- ing a graph representation. Nodes of the graph represent random variables and edges between nodes symbolize dependencies between the variables. Exploiting independence between vari- ables allows compact representation. The graph provides a modeling language to incorporate those independencies. Graphical model is a wide research topic on its own and here we only provide very brief introduction to some notions we have used in this thesis; for more in depth introduction see [131], [140].

In a graphical model, some nodes correspond to the observed variables and others denote latent (hidden) missing variables. There might be also nodes representing parameters (Θ) and hyper-parameters16. As a convention in this thesis, we represent observed variables with gray

circles and latent ones with white circles (see Figure 2.8a,2.8c,2.8d for examples). The two most common classes of graphical models are Bayesian Networks (BN) which are based on directed acyclic graphs (DAG) and Markov networks which are based on undirected graph17.

Let us assume that we have defined a Bayesian network with a DAG (G) on D variables,

[x1,· · · , xD] = x ∈ X ⊂ RD, the distribution over all variables in BN can be factorized as a 16Hyper-parameters describe distributions over parameters.

product P(x) =P(x1,· · · , xD) = D Y i P(xi|π(xi)) (2.4.1) whereP(xi|π(xi))is the conditional probability ofxi conditioned on its parents nodesπ(xi)18. Examples of BN are shown in Figure 2.8a,2.8c.

For the Markovian network (i.e.,undirected edges and cycle is allowed), the distribution can be factorized according to the product of non-negative potential functions:

P(x) =P(x1,· · ·, xD) = 1 Z Y C∈C ψ(xC), Z= Z X 1 Z Y C∈C ψ(xC) (2.4.2) whereCis the largest set of fully connected sub-graph (maximal cliques),Zis just a normalizer to produce a proper distribution, and xC is the set of all random variables in a clique C (see Figure 2.8b for an example). In order to apply a graphical model, one needs to know how to performLearningandInferencealgorithms over the graph. Giving a comprehensive survey over learning and inference algorithms is beyond scope of this chapter. Here, we provide very brief explanation for each.

Inferenceis about computingqueriesfrom the model. Both directed and undirected graphs are full joint probability of all variables. However, one might want to have a specific query from the model. The most common queries are conditional probability queryandmost probable query. In conditional probability query, we have some observations over a subset of random variables and we would like to compute the conditional probability over another set of variables, namely

P(xC1|xC2 =z)whereC2is the set of observed group,zis the observed value, andC1is the set of

variables we are interested in. In “most probable query”, we are interested in finding the most probable value given an observation. An obvious example of such query ismaximum a posterior

(MAP) which is mentioned earlier in Section 2.4.1, namelyarg maxxC1P(xC1|xC2 =z). Computing

(a) (b)

(c) (d)

Figure 2.8: This figure shows a few examples of graphical models. Figures (a), (c), and (d) are examples of Bayesian Network constructed with a Directed Acyclic Graph (DAG); more specifically (a) represents a Hidden Markov Model (HMM) [169]. (c) and (d) are equivalent, (d) is more compact representation; the box in (c) denotes repetition ofNvariables. (b) represents an example of Markov network constructed with an undirected graph. All gray nodes (yi’s) are observed variables and the white notes are the latent variables.

an exact inference for a general graph is intractable for large number of models; for this reason, we resort to approximations. In general, there are two frameworks for approximate inference: optimization-based and sampling-based. In optimization-based approach, a class of “easy” dis- tributions is defined, and then the objective of the optimization is to best approximate the query within that class. KL-divergence19is usually used to measure distance between distributions. In

sampling-based algorithms, the joint distribution is approximated as a set of instantiations to all or some of the variables in the graph. The instantiations (i.e.,samples) represent part of the prob- ability mass. The query function can usually be presented as an expectation. The approximation is done via generatingM samples20and computing empirical expectation (see [131] for more in

19Recall that relative entropy between

P1andP2is defined asD(P2|P2) =EP1[ln

P1(x) P2(x)]

20For example Markov chain Monte Carlo (MCMC) is an approach for generating samples from the posterior distri-

depth discussion).

Learningin graphical models includes two aspects: parameter estimation and structure learn- ing. In parameter estimation, it is assumed that general structure of the graph is given (i.e.,

dependencies between variables) and the task is to find the parameters given a training dataZ, In structure estimation, the objective is to extract both structure as well as parameters of Bayesian network or Markov network given the training data. In this thesis, whenever we use a graph- ical model, the structure is given and rationalized through a few arguments; see [131] for dis- cussion about structure learning in graphical models. Parameter estimation can be done with maximum likelihood estimation (MLE) or Bayesian approaches. The difference between the two approaches is that in Bayesian approach, a prior distribution is assumed over parameters to im- prove robustness against over-fitting. Nevertheless, the key ingredient for both is the likelihood function: the probability of the data given the model. Assuming that there are mindependent training samples, MLE maximizesJ(Θ;Z) =Qm

i=1P(zi|Θ)and Bayesian objective is to maximize

P(Θ)Qmi=1P(zi|Θ). The factorization formulations inEq.2.4.1 andEq.2.4.2 can now be exploited to decomposeP(zi|Θ)further. While estimation of the parameters in BN can be solved efficiently thanks to decomposability of parents and children random variables inEq.2.4.1, estimation of pa- rameters in Markov network usually involves iterative inference and local parameter estimation; therefore it is more expensive than parameter estimation in BN (see [131] for more details).

Documento similar