• No se han encontrado resultados

CAPÍTULO 2. ABORDAJE TEÓRICO DE LAS CATEGORÍAS DE ANÁLISIS Y

2.3 Modelo y enfoque pedagógico

2.3.2 Enfoque pedagógico: La enseñanza problémica

Parameter estimation is a standard problem in data analysis. For given measurements, model parameters are estimated to explain at best the data (learning process). Commonly the least-squares algorithm for estimating the parameters or the ML method to estimate the parameters and their uncertainties have been used (Dose, 2003).

Bayesian analysis and the ML method analyse the problem of parameter estimation in a probabilistic framework. BPT, as the ML method, provides estimate of parameters and their uncertainties. The main difference between the Bayesian and the ML approaches to parameter estimation is that BPT makes probability statements about the parameters, while classic statistics can not. In fact, in classic statistics parameters are not allowed to be random variables (O’Hagan, 2000).

In a ML estimation approach, the mode (or maximal value) of the likelihood func- tion, i.e. a pdf associated with the data given some parameters, is computed. Often, the Conjugate Gradient optimization technique is used for maximizing the likelihood. The ML solution maximizes the probability of the data. However, only a point in parameter space is found and no certainty is given for its uniqueness: A local maximum may be found instead of a global one. The local curvature of the likelihood function at the ML solution is used to construct error bars (confidence intervals). Hypothesis testing follows using a likelihood– ratio statistics. The strengths of the ML estimation rely on its large–sample properties: When the sample size is sufficiently large, then one can assume both normality of the test statistics about its mean and that the likelihood–ratio tests follow χ2 distributions. These

nice features don’t necessarily hold for small samples (see, e.g., Kendall and Stuart, 1979; Eadie et al., 1982; Loredo, 2004 for more details).

Bayes’ theorem, instead, combines initial knowledge about the distribution of the un- known parameters entering the model with the likelihood pdf of the data given the pa- rameters. The strengths of the Bayesian procedure in parameter estimation are due to the employment of not only prior knowledge, but also marginalization (described in Section 2.1.3): See Loredo (2004) for more details. The Bayesian solution to parameter estimation is the full posterior pdf of the parameters and not just a single point in the parameter space. Hence, BPT allows one to obtain a predictive distribution of the parameters. The values of the parameters and their uncertainties are derived from their joint posterior pdf. Probability contours (credible regions) in the parameter space describe uncertainties of the parameters. A credible region R of probability p is the region of highest posterior density

containing a volume in parameter space p:

"

R

dθ p(θ|D, I) =p, (2.7)

whereθ is a set of parameters (Loredo and Lamb, 2002). Credible regions are more robust than confidence intervals in classical statistics (Connors, 1997). In fact, with BPT there is no need to employ sample data drawn from a population to derive statements about the parameters.

BPT provides a valid approach to parameter estimation also for moderate and small data sets. A peculiarity of the Bayesian approach to parameter estimation is shown by the accuracy estimates of the parameters, which depends on the estimated noise. Everything probability theory can not fit to the model is assigned to the noise. Large uncertainties in model parameters are assigned when the noise is estimated to be large (see Bretthorst, 1988 for more details).

Note that a parameter contained within a model for the prior distribution for multiple parameters, which are themselves directly included in a model describing the data, is called

hyperparameter.

Prior information

Physical situation always supports proper information (Fischer and Dose, 2002). Within BPT, each relevant information entering the models is explicitly stated. Priors are neces- sary to perform the ’probability inversion’ of eq. (2.4). Priors account for the geometry of the hypothesis space, converting the likelihood from ’intensity’ to ’measure’ (Loredo, 2009). Prior information, encoded in probability distributions, helps to improve estimates of parameters (Bretthorst, 1988).

In order to formulate a distribution given a certain state of a priori knowledge, the principle of maximum information entropy is used (Dose, 2003). The maximum entropy (MaxEnt) principle assigns probabilities to incomplete or uncertain information, allowing one to maximize the uncertainty in the probability distribution (Gregory, 2005). With the MaxEnt principle, constraint (or testable) information is combined with Shannon’s entropy measure of the uncertainty of a probability distribution to arrive at a unique probability distribution (Shannon, 1948; Jaynes, 1968, 2003). Maximizing entropy achieves the probability distribution which is most conservative and noncommittal while agreeing with the available information. One example is given when prior information is constrained to a mean value. The distribution which has maximum entropy, subject to a given average value, is an exponential function: be θ a parameter and ˆθ a point estimate (i.e. the only knowledge about θ), then the MaxEnt distribution is

p(θ|θ, Iˆ ) = exp(−θ/λ)

Z(λ) , (2.8)

where λ must be determined such that < θ >= ˆθ and Z(λ) is the partition function (Jaynes, 1968). If the support range of θ is 0 θ < , then eq. (2.8) is simplified with

λ = ˆθ and Z(λ) = ˆθ. On the same line of this example, the distribution with maximum entropy is given by a Gaussian function, when prior knowledge is constrained to the mean value and the variance of the distribution. Last, when no constraint is applied, then the distribution with maximum entropy is a uniform distribution. Flat priors are the least informative ones. A flat prior of a parameter gives the same probability to each model parameter value within the range of the prior. See, e.g., Jaynes (1968, 2003); O’Hagan (2000); Dose (2003); Gregory (2005) for more details.

The impact of the prior pdf on the posterior pdf can be tested employing differ- ent choices of priors. When the choice of prior pdf does not change the posterior pdf significantly, then the data (i.e. the likelihood function) contain significant information (Kass and Wasserman, 1996).