• No se han encontrado resultados

1. MEMORIA

1.10. FORMACIÓN E INFORMACIÓN EN SEGURIDAD Y SALUD

f (X) = f1(XG11) + f2(XG21) + · · · + fK(XGK1)

This underlying relationship allows more than one predictor to be causal, but insists that the causal predictors contribute independently and additively. When we introduce more than one function, it may be necessary to safeguard against unidentifiability. For example, if we have

two functions, each of the form fk(Xig) = θk0+ θk1Xig, it is possible to alter θ10 and θ20 with-

out changing f1 + f2. The easiest solution to this problem is to merge each θk0 into a global

intercept θ0. In this case, the degrees of freedom of the model reduces from 2K to 1 + K.

Similarly, if using functions of the form fk(Xig) = θkd, we can introduce a global intercept

term θ0, which acts as a base value, and assign θk1 = 0, for k = 1, 2, . . . , K. Again, the total

degrees of freedom will be reduced by K − 1.

For categorical predictors, each fk, and therefore Pkfk, can be represented by a linear

model. An all-inclusive approach is to let K = N and write the underlying relationship as

f (X) = J Θ, where J = [1 J1 J2· · · JN] and Θ = [θ0, θ1, θ2, . . . , θN]. In this model, θg

represents the coefficients specific to predictor g. In most cases, the degrees of freedom of this

function will far exceed n. Therefore, it becomes necessary to encourage most θg to have either

zero (or negligible) magnitude, indicating that predictor g does not (significantly) contribute.

In the frequentist set-up, there are many flavours of penalty term which will have the desired effect. Perhaps the simplest of these is “variable subset selection”, which enforces a penalty based only on the number of non-zero regression coefficients. An example penalty term is the Akaike Information Criterion (Akaike, 1974) which simply increases the residual

sum of squares by an amount proportional to the number of non-zero coefficients. Variable subset selection generally uses a stepwise search of the model space. Each model dictates which regression coefficients are non-zero, conditional on which the best fit can be calculated using the least squares estimates.

“Ridge regression” is a description given to methods which penalise based on the sum of the squares of regression coefficients (e.g. Zhang and Xu, 2005; Park and Hastie, 2008). By contrast, the LASSO method penalises according to the sum of their absolute values (Tib- shirani, 1996). Generally, the penalty term is prefaced by a scale factor λ, so that as λ → 0 the solution approaches the least squares estimates. Ridge regression and the LASSO can be compared by considering their effect on the best fit as λ is increased from zero. For ridge regression, the least squares estimates of the regression coefficients are reduced in a continuous fashion, only reaching zero when λ = ∞. For the LASSO, the estimates reach zero at differ- ent points, depending on the predictors’ relative contributions to f (X). This highlights the differences between each method’s sparsity assumption. The former supposes that there are a few strong associations, while most predictors contribute only slightly; the latter supposes most predictors contribute in no way at all.

Many frequentist methods have Bayesian analogies. For example, variable subset selection equates to placing a point mass on elements of Θ (e.g. Kuo and Mallick, 1998), ridge re- gression corresponds to a normal prior (e.g Zhang et al., 2005; Wang et al., 2005), while the LASSO relates to a double-exponential distribution (e.g. Yi and Xu, 2008; Hoggart et al., 2008).

The use of mixture priors allows more complicated methods to be devised. Shotgun

Stochastic Search (SSS ; Hans et al., 2007) is one of these. Given a prior probability of association p ∈ (0, 1), it assigns the regression coefficient corresponding to the gth predictor the spike and slab prior distribution

P(θg) = (1 − p)δ{0}+ p N(0, σ2),

where δ{0} represents a point mass function at 0 which “integrates” to 1, and N(0, σ2) denotes

a normal distribution with mean 0 and variance σ2. SSS searches the model space in a stepwise

fashion, at each step deciding whether to add in, swap out or remove a contributing predictor. The method calculates the posterior scores for all models within the “neighbourhood” of the current state; those models reachable by a single move of the type add, swap or remove. Based upon these scores, SSS constructs a proposal distribution from which it picks which model to move to next.

SSS keeps track of the top scoring models it explores, from which it estimates posterior probabilities of association for each predictor. The accuracy of these estimates depends on the extent that the model search succeeds in identifying the best models. In essence, SSS tries to approximate the complete space of models by its list of top scoring models, so the greater the proportion of posterior weight contained within this list, the more accurate the approximation will be. As Hans et al. discuss, rather than at each step automatically accepting the proposed move, they could instead adopt a conventional MCMC strategy and calculate an acceptance probability. The method could then calculate posterior estimates in the normal fashion, based on how often each predictor is included in the Markov Chain. The authors conclude, however, that their search is preferable.

For quantitative predictors, this category of underlying relationship takes the form of the generalized additive model (Hastie and Tibshirani, 1990), with a Bayesian version discussed in Ravikumar (2009). As with functional data analysis, these methods are suited for very small numbers of predictors and when prediction, rather than variable selection, is the main focus.