• No se han encontrado resultados

Conclusiones, limitaciones y líneas futuras

● Curva de aprendizaje

10. Conclusiones, limitaciones y líneas futuras

To estimate PMFs reliably, it is essential to select appropriate moment functions. These moment functions provide the PMFs with sufficient cumulative information extracted from data. The moment function selection not only can significantly improve the accuracy of the estimation, but it also provides a means to control the computational complexity of the optimization step with minimum information loss due to coarse discretization of distribution functions. As an example, estimation of the Gaussian distribution requires only the first moment (mean) and the second central moment

(variance). Additional moments do not provide considerable additional knowledge from the data to arrive at a substantially different estimated distribution. Table 3.1 lists the minimal set of moment functions that are needed to characterize a number of well-known continuous distribution functions. However, since the true distribution that has given rise to observed data is generally unknown, a decomposition technique is used here to approximate the true moment functions underlying the observed samples.

In order to simplify the search for the appropriate moment functions, it is proposed here to search for the moment functions , where is a positive integer to be selected by the user, and

(3.9)

Thus, Eq.(3.7) becomes

, (3.10) then with an appropriate , the appropriate parameter vector of the Taylor series power expansion and Eq.(3.10), the equality between the expectation of the true moment function and its associated moment will be achieved. This decomposition indicates that the user just needs to choose an appropriate value for the positive integer , instead of choosing appropriate moment functions needed in the original formulation of Eq.(3.7). The use of a higher may seem to provide higher accuracy at the first look. However, in general, because (a) the use of a higher imposes higher computational costs and (b) more complex moment functions often fail to predict the actual probability behavior

Table 3.1: Some exponential probability distribution functions and their characteristic moments.

Distribution Moments Density Function

Exponential Gaussian Beta Gamma Weibull

outside the range of the data (due to over-fitting),29 one should use an adequately large , as suggested by Occam’s razor principle.30 Hence, there is a tradeoff between informationloss in lower‐ order moment functions and high variance of higher‐order ones, particularly when prediction of probabilities outside the observed region is intended. In view of these, the lowest level of complexity, , which satisfies an error tolerance threshold, should be chosen. To find an optimal systematically, we propose two methods that consider the tradeoff between bias and variance of the estimator.

Maximum Likelihood Estimation of the Truncation Orders

Likelihood of a parameter given a data base is simply defined as conditional probability of given or

(3.11) where and are the likelihood function and conditional probability, respectively. Therefore, to calculate the likelihood function, conditional probabilities must be available. Such a definition is the basis of the maximum likelihood estimation.27 The likelihood function indicates how well the observed data samples are described by the parameters, .

(3.12) where denotes the model-prediction of the probability of state using of the moment

functions up to order and represents the number of data points in the state. Note that .

Similar to the behavior observed in the case of mean square error and the bias- variance tradeoff31 as the complexity level of a model increases, the model fit the data

better and the likelihood function increases. These trends continue up to a certain complexity level beyond which these trends reverse; beyond this level of complexity (here, ) the likelihood of the data (as measure of accuracy of the model) decreases but the computational cost increases. The value of that yields the best fit is called the maximum likelihood estimate (MLE) of the parameter :

(3.13) which agrees with Occam’s razor principle. However, since this maximum occurs at high orders of while showing no significant increase through a wide range of lower values of , user may decide to select a lower order which satisfies some minimal goodness- of-fit criterion while keeping the computations more tractable.

Maximum a Posteriori Estimation of the Truncation Orders

If the Bayes rule is used to relate the likelihood and a priori probability over the model’s complexity, one can setup a framework to incrementally update our belief about the complexity level. Unlike the MLE, which defines a point-wise estimation, the Bayesian model selection provides a distribution for the complexity level; i.e., we can derive confidence intervals for our parameter, in addition to other statistical characteristics. Using the Bayesian model averaging32 we obtain

(3.14)

in which stands for the maximum truncation order when equal orders for all truncations are used. Eq.(3.14) allows us to average over different complexity levels to derive a distribution for . However, it is oftentimes not possible to calculate this sum. As a general solution, we approximate by

(3.15) where

(3.16) The right hand side equation is based on Bayesian belief updating. The parameter is also known as complexity controlling parameter. denotes the prior probability of Since the likelihood function, , as stated by Eq.(3.12), does not have a closed form in general, setting up a conjugate prior for the likelihood function is not possible. However, we can still assign an informative prior, for example a normal distribution with a zero mean and some positive number as the variance. As more information is incorporated into this function through the likelihood term, the updated belief about approaches its true value. If the mode of this posterior is used as our point estimate of , this estimation is called maximum a posteriori (MAP) estimation. Eq.(3.16) implies if a uniform distribution is used as the prior, MLE and MAP estimates indicate the same result for . In the next section we apply these concepts and algorithms to an example Bayesian network.

Documento similar