1. Primera Parte
1.2 Naturaleza del Proyecto
The final topic on categorical data modeling to be introduced is the Bayesian network. A
Bayesian network is a graphical decision model consisting of variables, represented by nodes, as
well as the direct dependencies or associations between the variables, which are represented by arcs. It is a directed graph that does not contain cycles. A Bayesian network is used for probabilistic inference, or querying the probabilities of certain variables when the values of other variables are known. For example, one of the main applications of a Bayesian network is determining the most likely cause for a given effect, also known as diagnostic, or bottom-up, reasoning. Top down reasoning can also be performed, in which the probability of effects given causes is computed.(160 161 162, , ) Within diagnostic reasoning, the explanatory variables can be ranked based on their value of information and the degree to which they reduce the uncertainty of the effect.(163 164, ) For the hazmat release model, this was used to identify the variables that should be the top priorities for policy change. In general, the Bayesian network has become a popular means of modeling expert or decision support systems, such as for medical diagnosis or other trouble-shooting applications.
The dependency, or association, structure of a Bayesian network is one of its two major components. Only two-way associations and conditional independencies are depicted in directed graphs. The absence of an arc indicates conditional independence between two variables.
Three-way and higher-order associations are represented only indirectly through multiple arcs. With a directed graph, if two variables are connected by an arc to a third variable, the three- variable interaction is automatically represented in the graph by the connecting arcs.(165) Therefore, even if the exact form of the relationships among the variables is not known, it does not matter because the uncertainty is represented probabilistically.
A typical method of building the structure of a small to moderate sized Bayesian network is manually with the assistance of an expert. Newer methods, which are often applied to larger networks or in the absence of a readily available expert, are machine learning or algorithmic approaches involving inductive inference or search for the most probable structure.(166) Learning modules were implemented in academic Bayesian network software beginning in the early 1990’s.(167) A learning module was just implemented in GeNIe, the decision model software used in this research, in the summer of 2005. An opportunity for future research is a comparison of the results of loglinear modeling with those of learning algorithms for building the structure of the network.
The second major component of a Bayesian network is the quantitative portion, and it represents the joint probability distribution among all the variables. The joint probability distribution is calculated using the conditional probability distribution associated with each node in the network. The conditional probability distribution of a node is the probability that the node takes on each of its possible values given every combination of values of its parent nodes. The joint and conditional probabilities are related according to the chain rule. The chain rule states that for a Bayesian network over the variables U={A1,…,Am}, the joint probability distribution P(U) is the product of all conditional probability distributions specified in the network.
Specifically, , )) ( | ( ) ( =
∏
i i i pa A A P U P Equation 17where pa(Ai) is the parent node set of node Ai. When a variable has no parents, the probability
distribution is the prior distribution.(168) In order to determine the quantitative portion of the Bayesian network, the conditional probability distribution for each variable, or node, must be calculated. The conditional probability distribution for a variable A given its parents B and C is calculated according to Equation 18.(169)
. ) ( ) ( ) , | ( C B P C B A P C B A P I I I = Equation 18
This equation is easily extended to include additional parent variables by including them in the numerator and denominator in the same manner as B and C. To calculate for category combination A=i, B=j, and C=k, the number of records in which A=i and B=j and C=k is divided by the number of records in which B=j and C=k.
) , |
(A B C
P
(170) The conditional probability distribution for A contains i x j x k probabilities, so there is a probability associated with each category combination i,j,k.
Conditional probabilities can be determined based on record counts from a database or subjective data or beliefs from an expert. All probabilities calculated for the Bayesian network in this research were calculated using record counts, or frequency data, from the HMIRS database. Frequency data can be used when dealing with repetitive events that have been recorded. However, a database may not be available, or the event may not be repetitive, for
instance a nuclear war. In these cases, the conditional probabilities must be assessed subjectively by an expert. The subjectivist view considers probability as a measure of personal belief. Hence, Bayesian networks are also known as belief networks.(171)
The foundation of inference in Bayesian networks is Bayes Theorem, which enables inference in any direction in the network. Using Bayes Theorem, some probabilities are updated based on new evidence, or specific values, of other probabilities.(172) Several algorithms exist for performing inference in a Bayesian network. The clustering algorithm, in which the directed graph is converted to a junction tree where the probabilities are then updated, is the fastest known exact algorithm. The clustering algorithm is the default algorithm implemented in
GeNIe, which is discussed next.(173) 3.8.1 Bayesian Network Software
The decision model software used in this research was GeNIe, a graphical decision-theoretic package developed at the Decision Systems Lab at the University of Pittsburgh. GeNIe is a development environment for Bayesian networks and influence diagrams and is available to the community at no cost. Using GeNIe, the modeler builds the network structure using circular nodes and arcs in an intuitive, graphical environment. Conditional probability distributions can be copied into GeNIe for each node, making the construction of the network very efficient. Once this is complete, the various forms of inference discussed previously can be performed.
Decision models can be studied in terms of value of information, which refers to the information value of a parent variable relative to the outcome variable. The information value of a parent variable can also be viewed as its ability to influence or reduce uncertainty in the outcome variable. An entropy-based value function is used in GeNIe to rank the parent variables based on their information content in relation to the outcome variable. This function determines
the decrease in entropy, or uncertainty, by observing a given parent variable. Entropy is a concept from the field of information theory and is used to measure the information value of a variable, which represents the expected amount of information needed to classify a new instance involving the variable.(174)