• No se han encontrado resultados

CAPÍTULO 4. LA INCLUSIÓN EN AMBIENTES UNIVERSITARIOS DE PERSONAS CON

4.3 La evaluación y certificación de competencias en PCDI

There are a variety of other approaches we have not mentioned, most of which are designed for discrete state-spaces.8

• Truncating small numbers: simply force small numbers in the clique potentials to zero, and then use zero-compression. If done carefully, the overall error introduced by this procedure can be bounded (and computed) [JA90].

• Structural simplifications: make smaller cliques by removing some of the edges from the triangulated graph [Kja94]; again the error can be bounded. A very similar technique is called “mini buckets” [Dec98], but has no formal guarantees.

• Bounded cutset conditioning. Instead of instantiating exponentially many values of the cutset, sim- ple instantiate a few of them [HSC88, Dar95]. The error introduced by this method can sometimes be bounded. Alternatively, we can sample the cutsets jointly, a technique known as blocking Gibbs sampling [JKK95].

8The following web page contains a list of approximate inference methods c. 1996. camis.stanford.edu/people/pradhan/approx.html

Appendix C

Graphical models: learning

C.1

Introduction

There are many different kinds of learning. We can distinguish between the following “axes”:

• Parameter learning or structure learning. For linear-Gaussian models, these are more or less the same thing (see Section 2.4.2), since 0 weights in a regression matrix correspond to absent directed edges, and 0 weights in a precision matrix correspond to absent undirected edges. For HMMs, “structure learning” usually refers to learning the structure of the transition matrix, i.e., identifying the 0s in the CPD forP(Xt|Xt1). We consider this parameter learning with a sparseness prior, c.f., entropic learning [Bra99a]. In general, structure learning refers to learning the graph topology no matter what parameterization is used. Structure learning is often called model selection.

• Fully observed or partially observed. Partially observed refers to the case where the values of some of the nodes in some of the cases are unknown. This may be because some data is missing, or because some nodes are latent/ hidden. Learning in the partially observed case is much harder; the likelihood surface is multimodal, so one usually has to settle for a locally optimal solution, obtained using EM or gradient methods.

• Frequentist or Bayesian. A frequentist tries to learn a single best parameter/ model. In the case of parameters, this can either be the maximum likelihood (ML) or the maximum a posteriori (MAP) estimate. In the case of structure, it must be a MAP estimate, since the ML estimate would be the fully connected graph. By contrast, a Bayesian tries to learn a distribution over parameters/ models. This gives one some idea of confidence in one’s estimate, and allows for predictive techniques such as Bayesian model averaging. Although more elegant, Bayesian solutions are usually more expensive to obtain.

• Directed or undirected model. It is easy to do parameter learning in the fully observed case for directed models (BNs), because the problem decomposes into a set of local problems, one per CPD; in particu- lar, inference is not required. However, parameter learning for undirected models (MRFs), even in the fully observed case, is hard, because the normalizing termZ couples all the parameters together; in particular, inference is required. (Of course, parameter learning in the partially observed case is hard in both models.) Conversely, structure learning in the directed case is harder than in the undirected case, because one needs to worry about avoiding directed cycles, and the fact that many directed graphs may be Markov equivalent, i.e., encode the same conditional independencies.

• Static or dynamic model. Most techniques designed for learning static graphical models also apply to learning dynamic graphical models (DBNs and dynamic chain graphs), but not vice versa. In this chapter, we only talk about general techniques; we reserve discussion of DBN-specific techniques to Chapter 6.

• Offline or online. Offline learning refers to estimating the parameters/ structure given a fixed batch of data. Online learning refers to sequentially updating an estimate of the parameters/ structure as each data point arrives. (Bayesian methods are naturally suited to online learning.) Note that one can learn a static model online and a dynamic model offline; these are orthogonal issues. If the training set is huge, online learning might be more efficient than offline learning. Often one uses a compromise, and processes “mini batches”, i.e., sets of training cases at a time.

• Discriminative or not. Discriminative training is very useful when the model is going to be used for classification purposes [NJ02]. In this case, it is not so important that each model be able to explain/ generate all of the data; it only matters that the “true” model gets higher likelihood than the rival models. Hence it is more important to focus on the differences in the data from each class than to focus on all of the characteristics of the data. Typically discriminative training requries that the models for each class all be trained simultaneously (because of the sum-to-one constraint), which is often intractable. Various approximate techniques have been developed. Note that discriminative training can be applied to parameter and/or structure learning.

• Active or passive. In supervised learning, active learning means choosing which inputs you would like to see output labels for, either by selecting from a pool of examples, or by asking arbitrary questions from an “oracle” (teacher). In unsupervised learning, active learning means choosing where in the sample space the training data is drawn from; usually the learner has some control over where it is in state-space, e.g., in reinforcement learning. The control case is made harder because there is usually some cost involved in moving to unexplored parts of the state space; this gives rise to the exploration- exploitation tradeoff. (The optimal Bayesian solution to this problem, for the case of discrete MDPs, is discussed in [Duf02].) In the context of causal models, active learning means choosing which “perfect interventions” [Pea00, SGS00] to perform. (A perfect intervention corresponds to setting a node to a specific value, and then cutting all incoming links to that node, to stop information flowing upwards.1

A real-world example would be knocking out a gene.)

In this chapter, we focus on the following subset of the above topics: passive, non-discriminative, offline, static, and directed. That leaves three variables: parameters or structure, full or partial observability, and frequentist or Bayesian. For the cases that we will not focus on here, here are some pointers to relevant papers or sections of this thesis.

• Discriminative parameter learning: [Jeb01] discuss maximum entropy discrimination for the exponen- tial family, and reverse Jensen/EM to handle latent variables; [EL01] discuss the TM algorithm for maximizing a conditional likelihood function from fully observed data; [RR01] discuss deterministic annealing applied to discriminative training of HMMs.

• Discriminative structure learning: [Bil98, Bil00] learns the interconnectivity between observed nodes in a DBN for isolated word speech recognition.

• Online parameter learning: see Sections 4.4.2 and C.4.5.

• Online structure learning: [FG97] discuss keeping a pool of candidate BN models, and updating it sequentially.

• Dynamic models: see Chapter 6.

• Undirected parameter learning (using IPF, IIS and GIS, etc.): see e.g., [JP95, Ber, Jor02]. • Undirected structure learning: see e.g., [Edw00, DGJ01].

• Active learning of BN parameters: [TK00].

1For example, consider the 2 node BN where smokingyellow-fingers; if we observe yellow fingers, we may assume it is due to

nicotine, and infer that the person is a smoker; but if we paint someone’s fingers yellow, we are not licensed to make that inference, and hence must sever the incoming links to the yellow node, to reflect the fact that we forced yellow to true, rather than observed that it was true.

• Active learning of BN structure: [TK01, Mur01a, SJ02].

Note that the case most relevant to an autonomous life-long learning agent is also the hardest: online, active, discriminative, Bayesian structure learning of a dynamic chain-graph model in a partially observed environment. We leave this case to future work.