In this section, we will first introduce application of the nonparametric framework on Bayesian models and then extend it to the hierarchical Bayesian (HB) models. Assume that there are N observations D= {y1, y2, . . . , yN}. Each observation yi is a continuous two-dimensional random variable, and the two dimensions are independent with each other, i.e. the covariance is 0. In Bayesian modeling, we assume that the observations are i.i.d. drawn from a multivariate Gaussian distribution with mean µ = (µ1, µ2) and covariance matrix Σ = σ2 1 0 0 σ2 2
. To simplify the model, we further assume that Σ is known, but µ is unknown. As usual we assume a conjugate prior, i.e. µ is drawn from a multivariate Gaussian distribution, µ ∼ N(µprior,Σprior). The hyperparameters
4.2. MODEL DESCRIPTION 47
µprior = (µprior,1, µprior,2) and Σprior =
σprior,12 0 0 σprior,22
are mean and covariance matrix of the prior distribution, respectively. Given the Bayesian model and the observations D, the computation of the posterior distribution is straightforward, which is still a Gaussian distribution with parametersµpost = (µpost,1, µpost,2) and Σpost =
σ2 post,1 0 0 σ2 post,2 , where µpost,1 = µprior,1 σ2 prior,1 + PN i=1yi1 σ2 1 1 σ2 prior,1 +σN2 1 ; σpost,12 = ( 1 σ2 prior,1 + N σ2 1 )−1,
µpost,2 and σ2post,2 are computed in an equivalent way. Figure 4.1(a) shows the known data, which is distributed as Gaussian. The prior and the posterior distributions of µare shown as Figure 4.1(b) and (c). So far the Bayesian inference is performed in an ideal situation where the data really follows Gaussian distribution as we assume. However, in many cases, the observations yi’s are not exactly Gaussian, but an arbitrary distribution, e.g. a distribution shown as Figure 4.1(d), which can not be approximated by a Gaus- sian distribution with any parameters. To solve this problem, it is nature to embed the Bayesian model in a nonparametric framework, i.e., consider the likelihood distribution itself, rather than the parameters, as a random variable. That means we do not specify the functional form of the likelihood distribution in advance. Therefore, what we learn from the data is the probability distribution itself, rather than the parameters. Note, that the prior distribution in the nonparametric model is not a distribution over parameter space, but a distribution over a set of distributions. Furthermore, the data in the nonparametric model can be any arbitrary distribution without the limitation about scope and type. Fig- ure 4.2(b) shows the nonparametric model. In contract with the parametric model shown as Figure 4.2(a), the likelihood is an arbitrary distribution G drawn from P(G), rather than a distribution with a specific mathematic form and unknown parameters. From the figure, it is clear how the samples are generated in the nonparametric Bayesian model. Given a prior P(G), specifying the probability of the likelihood, a sample distribution G
is drawn and then the samples yi are i.i.d. drawn fromG.
Now we introduce how to apply the nonparametric framework to the hierarchical Bayesian (HB) model. In HB model, the common prior of the parameters is of central importance. It is expected to be flexible enough to represent the true situation. However, in many cases, a parametric prior is often too strict to meet the expectation. Therefore we consider to embed the hierarchical Bayesian modeling in the nonparametric frame- work, i.e. the unknown prior G is a sample distribution drawn from a probability model
P(G), such that G can be of any mathematic form to truthfully represent the learned knowledge. Assume that there are M parallel data sets D = {D1, D2, . . . , DM}, and in the data set Dj, there are Nj observations Dj = {yj,1, yj,2, . . . , yj,Nj}. Assume that the
likelihood distribution of each data set Dj is of the same functional form but distinct pa- rametersθj. Theθj’s share a common prior. Figure 4.3 shows a parametric HB model and a nonparametric HB model for the example. What distinguishes nonparametric model from parametric model is that in the nonparametric model the prior can be any arbi- trary distribution, not a distribution with specific form. The generative process of the
48 CHAPTER 4. NONPARAMETRIC HB MODELS
(a) (b) (c)
Figure 4.2: (a) A parametric Bayesian model forD={y1, y2, . . . , yN}. The observations are i.i.d. drawn from a Gaussian distribution with parameters µ and Σ. We assume Σ is known but µis unknown and follows a Gaussian distribution with hyperparametersµprior and Σprior. (b) A nonparametric Bayesian model in the same setting. In contract with the parametric model, the likelihood is an arbitrary distribution G drawn from P(G), rather than a distribution with specific mathematic form and unknown parameters. (c) The equal model to (b).
nonparametric HB model is as follows:
G|P(G)∼P(G).
θj|G∼G(θj) for j ={1, . . . , M}.
yj,i|θj ∼P(yj,i|θj) for i={1, . . . , Nj}.
Of central importance in nonparametric framework are the unknown distribution G
and its probability model P(G). Generally G is called random probability distribution (RPD). Ferguson (1973) and Antoniak (1974) stated two desirable properties of P(G). First, it should be largely supported, i.e.P(G) is expected to cover most of the probability distributions on a given sample space. Second, the posterior inference should be compu- tationally manageable, since the integral on the infinite function space is difficult. So far, many probabilistic models about P(G) have been developed, including Dirichlet Process (DP), invariant DP, P´olya Trees, Bernstern Polynomials, logistic normal process and so on, in which DP is commonly used in the area of statistic machine learning. Dirichlet process is generally denoted as DP(α0, G0), where α0 and G0 are the parameters. The strategy, replacing the parametric prior distribution with a sample from DP, is called
Dirichlet enhancement (Escobar & West, 1998), which extends the flexibility of the para- metric Bayesian modeling by encoding the additional uncertainty about the functional form of the prior. As an important result, Dirichlet enhanced models not only represent one’s prior knowledge via the parameters of DP, i.e.α0 and G0, but also make the priorG (i.e. a sample distribution fromDP) as complex as necessary to model the real situation. In the next section, we introduce some detailed information about DP.