1.5 SOFTWARE UTILIZADO
1.5.2 MÉTODOS DE PROCESAMIENTO DE IMÁGENES DE OPENCV
In this section, priors for the nonparametric function parametersβ and for the parametric effectsγare specified. We assume the independence of prior specifications between separate functions and parametric effects, and between functions and parametric effects of different latent variables; thus we obtain
p(β,γ) = q Y r=1 g Y h=1 p(βrh)· q Y r=1 p(γr).
Priors for metric covariates (additive models, varying coefficient models) are based on Gaussian smoothness priors (see Fahrmeir and Tutz, 2001), and priors for spatial covariates (geoadditive models) are based on Markov random fields2 (see Besag, 1974; Besag and
Kooperberg, 1995). Conveniently, in the Bayesian approach both types of covariates can be treated in a unifying framework involving the use of a penalty matrixK.
Nonparametric effects of metric covariates
In this thesis, nonparametric effects for metric covariates can be modeled in three different ways: first-order random walks, second-order random walks, and P-splines. In order to simplify the notation in this section, we drop the indices of most of the variables, hence f
denotes the nonparametric function, β is the vector of function parameters, x represents the metric covariate, and d denotes the dimension of the vector of function parameters.
1. First-order random walk
First we consider the case of a metric covariate x with equally spaced observations
x(t), t = 1, . . . , d, d ≤ n. The unique observations x(t) are sorted according to
x(1), . . . , x(t), . . . , x(d), and thus define an equidistant grid on the x-axis. A classic
example is the covariate age, ranging from the age of 20 to 80 in a social survey, hence d= 61. Let us set βt:=f(x(t)) and let
β = (β1, . . . , βt, . . . , βd)0
denote the vector of function evaluations according to Section 3.2. The first-order random walk is defined as
βt=βt−1+ut with ut ∼N(0, κ2),
t = 2, . . . , d and a diffuse prior β1 ∝ constant. The parameter βt is determined by the previous value βt−1 plus a normally distributed random error ut with mean 0 and variance κ2, i. e. β
t|βt−1, κ2 ∼ N(βt−1, κ2). The expected value of βt coincides 2Other spatial modeling methods might be included in the future, e. g. two-dimensional P-splines.
with the expected value of βt−1, and consequently this prior specification penalizes
value differences between two successive observations. The entire prior distribution of a function f with a vector of function parametersβ for a first-order random walk follows as p(β) = d Y t=2 p(βt|βt−1, κ2)∝exp(− 1 2κ2 d X t=2 (βt−βt−1)2) = exp(− 1 2κ2β 0Kβ),
with the penalty matrix
K = 1 −1 −1 2 −1 . .. ... ... −1 2 −1 −1 1 .
The generalization for non-equidistant observations is straightforward and is detailed in appendix B.
After parameters have been estimated, the function evaluations of all observations
i are given by Xβ with the (n×d)-dimensional design matrix X. Each row of X
contains the value 1 in that column number corresponding to the respective obser- vation, all other columns in that row are zero. To continue the example from above, we observe ages from 20 to 80 and thus the design matrix X has 61 columns; if observation i has age 40, the i-th row of X contains a one in the 21st column, and zeros in all other columns.
2. Second-order random walk
Definitions and notation is identical to the case of the first-order random walk. The second-order random walk is defined as
βt = 2βt−1−βt−2+ut with ut ∼N(0, κ2), (5.4)
t = 3, . . . , d, and diffuse priors β1 ∝ constant and β2 ∝ constant. The para-
meter βt is determined by the doubled previous value 2βt−1 minus the value βt−2
plus a normally distributed random error ut with mean 0 and variance κ2, i. e.
βt|βt−1, βt−2, κ2 ∼ N(2βt−1 −βt−2, κ2). Since the expected value of βt can be in- terpreted as the extrapolated value of the straight line through βt−1 and βt−2, the
second-order random walk prior penalizes deviations from the linear trend. Typically, the second-order random walk generates visually smoother functionsf than the first- order random walk. The entire prior distribution of a function f with second-order
5.2 Prior distributions 49
random walk follows as
p(β) = d Y t=3 p(βt|βt−1, βt−2, κ2)∝exp(− 1 2κ2 d X t=3 (βt−2βt−1+βt−2)2) = exp(− 1 2κ2β 0Kβ),
with the penalty matrix
K = 1 −2 1 −2 5 −4 1 1 −4 6 −4 1 . .. ... ... ... ... 1 −4 6 −4 1 1 −4 5 −2 1 −2 1 .
The generalization for non-equidistant observations is also detailed in appendix B. The design matrix X for second-order random walks is constructed in the same way as the design matrix of first-order random walks.
3. P-splines
Both random walks require the estimation of one parameter for each unique obser- vation x(t), leading to a high number of parameters and occasionally to an overfit of
the data. For that reasons, other ways of estimating smooth functions exist in the literature, see Fahrmeir and Tutz (2001). One method which became quite popular in recent years for smoothing in semiparametric models is the approach of penal- ized splines (P-splines). In the statistical community two different types of P-splines are basically discussed. The first approach stems from the work of Eilers and Marx (1996) who promote the use of a B-spline basis, equally-spaced knots and difference penalties, whereas Ruppert, Wand and Carroll (2003) employ truncated power func- tions, knots based on quantiles of the covariate of interest x and a ridge penalty. Eilers and Marx (2004) compared the two approaches along several dimensions (e. g. numerical stability, quality of fit) and concluded that the first approach is to be preferred. For that reason, we restrict our discussion to B-splines with difference penalties and equally spaced knots. The literature mentioned above covers the fre- quentist treatment of P-splines whereas our Bayesian approach is based on the work of Lang and Brezger (2004) and Brezger and Lang (2005), who give a detailed account on Bayesian P-splines in various settings.
The unknown function f of a metric covariate x is approximated by a polynomial spline of degree D, defined on a set of equally spaced knots xmin =%
0 < %1 < . . . <
%I−1 < %I =xmax with I intervals. This polynomial spline is constructed by a linear combination of d=D+I B-spline basis functions Bc in the following way:
f(x) = d X c=1
ρ0 ρ1 ρ2 ρ3 ρ4 0 0.5 1 B1 B2 B3 B4 B5 ρ0 ρ1 ρ2 ρ3 ρ4 0 0.5 1 B1 B2 B3 B4 B5 B6 ρ0 ρ1 ρ2 ρ3 ρ4 0 0.5 1 B1 B2 B3 B4 B5 B6 B7
Figure 5.1: Illustration of B-splines basis functions with D = 1 (top), D = 2 (middle) and
D= 3 (bottom) for I = 4 intervals and I+ 1 = 5knots κ0, . . . , κ4.
The vector of function parameters now contains the regression coefficients or weights of the individual B-spline basis functions, i. e. β = (β1, β2, . . . , βD+I)0. Note that
β does not contain function evaluations as is the case for random walk models. A B-spline basis function has the following characteristic properties:
A B-spline of degreeDconsists ofD+1 polynomial pieces connected atDinner knots;
Continuous derivatives to the orderD−1 exist at the knots;
A B-spline covers D+ 2 knots or D+ 1 regions between knots, and overlaps with 2D adjacent B-splines;
At each point on the covariate axis (apart from the knots),D+ 1 B-splines have a non-zero value.
More information about B-splines can be found in the mentioned literature. Figure 5.1 shows two examples of B-spline basis functions for the degreesD={1,2,3}, and
I = 4 intervals with I+ 1 = 5 knots.
The design matrixX for P-splines is more intricate than in the case of random walk priors. Each row iof X contains the values of the B-spline basis functions evaluated atxi, hence Xic =Bc(xi). In accordance with the fourth property of B-splines, each row in X has D+ 1 non-zero values. Thus the vector of function evaluations for all observations i is given by Xβ.
5.2 Prior distributions 51
A crucial question is the determination of the number of knots. The number of knots should be high enough to adapt to the underlying function f; however, it should not be too high to overfit the data. Eilers and Marx recommend the number of knots to range between 20 and 40, and introduce a penalization of the differences be- tween regression coefficients of adjacent B-spline basis functions in order to generate a smoothing effect. Ergo the smoothness of the function f is achieved through pe- nalizing too high differences of coefficients of adjacent B-splines, but not by altering the number of knots. In a Bayesian approach, this penalization is incorporated con- veniently by applying a random walk prior to the B-splines regression coefficients f. In our analyses, we typically choose B-splines of degree D= 3 withI = 10 intervals, and a second-order random walk prior on the B-splines regression coefficients.
Nonparametric effects of interactions (VCM)
The models discussed so far are not suitable for modeling interactions. As introduced in Section 3.2, in a VCM the function is of the form
f(xi) = f(˜xi, vi) =g(˜xi)vi,
where the effect modifiers ˜xi are continuous covariates, and the interacting variablesvi are metric or categorical. We restrict our model to cope with categorical interacting variables. Since the differences between two categories of an ordinal or categorical variable are not interpretable, we apply dummy coding forv (see Fahrmeir and Tutz, 2001). Let us assume that v has K categories, then we define
vi(k) = (
1, if sample i observes category k
0, else , k= 1, . . . , K .
The dummy coding implies the estimation of K different functions f(k) with function
parameter values β(k), so that the total part of the predictor for the function f results in
f =f(1)+. . .+f(K) =X∗β(1)+diag(v(2)
1 , . . . , v(2)n )X∗β(2)+. . .+diag(v
(K)
1 , . . . , vn(K))X∗β(K). Here the reference category was set to category 1, but arbitrary reference categories are possible. The design matrix X∗ is the usual design matrix associated with the continuous
function g(˜x) which can be modeled by either a random walk prior or a P-splines prior. For example, we could identify two separate nonparametric functions of the continuous covariate age for the effect modifier gender with two categories.
Nonparametric effects of spatial covariates
In this section we discuss the prior distribution of spatial covariates. Let us assume co- variate xi denotes the region c of observation i, and the vector of function evaluations
β = (β1, β2, . . . , βd) contains the estimates of thed different regions. The spatial function
evaluations of all observationsi can be written asXβ with the (n×d)-dimensional design matrix X, whereXic = 1 if observation i is associated to region c; all other values of row
i equal zero. It can be quite useful to include a spatial covariate in an analysis in order to examine the geographical variation of the latent variables. Of course, the region itself has typically not a direct effect on the values of the latent variables, but there are certain underlying characteristics of each region which could readily influence their values. For example, the latent variable ”satisfaction with living conditions” surely depends on the existence of heavy industry polluting the air or on the local unemployment rate, both of which are varying across regions. The basic assumption is that adjacent regions should have a similar impact on latent values, while two regions far apart from each other do not exhibit such a similarity. In order to make a prior specification, the full neighborhood structure for each region has to be known. In our context, two regions are considered neighbors when they share a common boundary. Other definitions of neighborhood are described by Besag et al. (1991). We apply the following spatial smoothness prior to the function evaluations βc (c= 1, . . . , d) for all d regions:
βc|βe, e6=c, κ2 ∼N X e∈∂c βe Nc , κ 2 Nc , (5.5)
whereNcindicates the number of adjacent sites of regionc, ande∈∂cdenotes all regionse being neighbors of region c. Hence the conditional mean ofβcis an unweighted average of the function values of all adjacent regions. Since spatial data, e. g. regions, does not inhibit a natural ordering, a symmetric conditioning is applied. A more general prior including Equation (5.5) as a special case is given by
βc|βe, e6=c, κ2 ∼N X e∈∂c wce wc+ βe, κ2 wc+ , where the weights wce are not necessarily equal, and wc+ =
P
e∈∂cwce. For example, the
weights could depend on the length of the border or distance between two adjacent regions. In our analyses, the specialised prior of Equation (5.5) is always used. The entire prior distribution follows as p(β)∝exp − d X c=1 wc+ 2κ2 βc− X e∈∂c wce wc+ 2 = exp(− 1 2κ2β 0Kβ),
with the d-dimensional penalty matrix K whose entries are
kcc =wc+ and kce = (
−wce , e∈∂c, 0 , otherwise.
To conclude, we want to emphasize that priors of all nonparametric effects (metric, spatial, and interaction) can me modeled in a unifying framework with p(β) ∝ exp(− 1
2κ2β0Kβ)
5.2 Prior distributions 53
Hyperpriors of variances κ2 of nonparametric effects
We have defined all priors for nonparametric functions conditional on the variance κ2,
i. e. p(β) = p(β|κ2)p(κ2). The variance κ2 determines the smoothness of the resulting
function f, and is therefore called smoothing parameter. It is automatically estimated in our Bayesian approach. To complete the prior specification for nonparametric effects, we define the prior of the hyperparameterκ2 to be
p(κ2)∝ 1
(κ2)a+1 exp(−b/κ 2),
where a ∈ R and b > 0. If a > 0, this expression corresponds to an inverse Gamma distribution IG(a,b). The parameters a and b have to be chosen appropriately. Common choices includea =b = 0.001 leading to an almost noninformative prior forκ2; ora = 1 and
b equal a very small value, e. g. b = 0.005 as proposed by Besag et al. (1995). The choice of such highly vague but proper priors prevent problems associated with noninformative priors, such as the nonconvergence of the Gibbs sampler (see Hobert and Casella, 1996). If a noninformative and improper prior is used for the variancesκ2, the resulting posterior
can be improper which is not necessarily indicated by the sampling chains of the Gibbs sampler. In such a case the Gibbs sampler would yield random draws of a nonexistent posterior distribution. All simulation studies and data analyses in this thesis use a vague, but informative prior setting with a=b= 0.001.
However, Sun, Tsutakawa and He (2001) pointed out that the posterior can be proper when improper diffuse priors for the variance componentκ2 are chosen if certain conditions hold.
There are two types of improper priors that can be employed. The first improper prior is obtained for a = −1, b = 0 which yields p(κ2) ∝ 1; the parameters a =−0.5, b = 0 lead
to the second improper prior p(κ2)∝(κ2)−1/2. Since some statisticians argue that the use
of highly diffuse but proper priors influence the parameter estimates in a significant way, we check the effect of these two improper priors in Sections 6.3.2 and 6.3.3 where smooth functions of metric and spatial covariates are estimated in simulation studies.
Parametric effects
The conjugate prior distribution of the vector of regression coefficientsγris am-dimensional multivariate normal density with the meanγ∗
r and the precision matrixΓ∗r which are chosen by the researcher according to his prior information about the parameters, i. e.
γr∼N(γ∗r,Γ∗r−1).
In our analyses, we always choose noninformative priors for all regression parameters γr, hence all values ofΓ∗r are set to zero.
Full prior distribution of the structural equation model
Finally, the full specification of prior distributions for the structural equation is given by
p(β,γ) = q Y r=1 g Y h=1 p(βrh)· q Y r=1 p(γr)∝ q Y r=1 g Y h=1 exp(− 1 2κ2 rh β0rhKrhβrh)p(κ2rh)· q Y r=1 p(γr),
where the penalty matrices Krh, smoothing parameters κ2rh, and parametric effectsγr are defined in the last paragraphs.