6. EFECTO DE LA ADMINISTRACIÓN DE ECA SOBRE LAS MODIFICACIONES
6.3. Glicina
MARS, introduced by Friedman (1991), stands for Multivariate adaptive regression splines and is an adaptive procedure for regression that is well suited for high-dimensional prob- lems. MARS is introduced in Hastie et al. (2010), for instance, and only key features are underlined below.
MARS uses expansions in basis functions of the form (x − t)+ and (t − x)+, where
(x − t)+ = x − t, if x ≥ t 0, else and (t − x)+ = t − x, if x ≤ t 0, else (3.7)
So, each basis function is piecewise linear and we say that it has a knot at value t; actually, such functions are linear splines. The 2 functions in (3.7) are called a reflected pair. Let Z = (X, Y ) be our training data set with dim(Y ) = N and dim(X) = N × P . The
basic idea is to form reflected pairs for each predictor Xj with a knot at each value xij of
that predictor. Thus,
C = {(Xj− t), (t − Xj)} t ∈ {x
1j, x2j, . . . , xN j}
j = 1, 2, . . . , P form our collection of basis functions.
The model-building strategy is like a stepwise linear regression, but instead of using only
original predictors Xj, we are allowed to use functions in C as well as their products.
Therefore, the MARS model has the form:
f (X) = β0+
M
X
m=1
βmhm(X), (3.8)
where each hm(·) is a function in C or a product of functions in C. The fitting coefficients
βm can be estimated via ordinary least squares (see e.g. Section B.1, page 163).
More precisely, building proceeds for model M iteratively as follows: At the beginning,
we start with the basic model M = h0(X) = 1. At each stage, we consider, as a new basis
function pair, all products of a function hl already in M with one of a reflected pair in C,
i.e. the new candidate term to add to model M has the form: ˆ
βM +1hl(X) · (Xj− t)++ ˆβM +2hl(X) · (t − Xj)+, hl(X) ∈ M, (3.9)
where ˆβM +1and ˆβM +2are fitted via least squares. The term defined by (3.9) that produces
the largest decrease in training error is then retained 19.
19
In order to prevent unstable behavior, there is one restriction put on the formation of model terms: each input can appear at most once in a product, which prevents the formation of higher-order powers of an input.
The iterative process, also known as the forward pass, continues until the model M con-
tains some preset maximum number of terms, say Mceil. At the end of the forward pass,
we get then a large model of the form (3.8), that typically overfits the data. Therefore, a backward deletion procedure, or pruning pass, has to be applied to remove terms that are not statistically significant.
The pruning pass is also iterative and at each stage the term whose removal causes the smallest increase in residual squared error is deleted, producing an estimated best model
ˆ
fλ of size λ (λ = 1, 2, . . . , Mceil). In order to choose the optimal value λ, the MARS
pruning pass relies on generalized cross-validation (Craven and Wahba, 1979) which is an
approximation of the leave-one out cross-validation 20. The generalized cross-validation
(GCV) criterion is defined as
GCV(λ) =
PN
i=1(yi− ˆfλ(xi))2
1 − M (λ)/N , (3.10)
where M (λ) is the effective number of parameters in the model. If the model contains K knots and r linearly independent basis functions, then M (λ) = r + cK, where empirical evidence suggests c = 3 (or c = 2 if the model is restricted to be additive) (Friedman, 1991). Therefore selecting the optimal λ amounts to select the value of λ which minimizes GCV (λ).
An interest of piecewise linear basis functions is that they build the learning model in a parsimonious way. Indeed, such functions operate locally since they are zero over part of their range. When multiplied together, their products is non-zero only over a small part of the feature space, were both functions are non-zero. As a result the regression surface is built up parsimoniously.
Another property of MARS that has to be underlined is that the forward pass is hierar- chical in the sense that multiway products are built from product involving terms already included in the model. This avoids the search over an exponentially growing set of al- ternatives. This assumption is perhaps not always true, but is usually reasonable since high-order interaction will likely only exist if some of their lower-order components exist as well.
Finally, it is interesting to note the similarity between MARS and regression trees (see Section B.4 on page 171). Indeed, if the MARS procedure is modified as follows:
replace the linear splines with the indicator functions I(x − t < 0) and I(x − t ≥ 0); when a term gets involved in a multiplication with a candidate function, replace it by the corresponding interaction, so that it is not available any more in the fitting process.
Such a modified MARS algorithm is then equivalent to regression trees. Indeed, multiply- ing a step function by a pair of reflected step functions is equivalent to splitting a node at that step. The second modification implies that a node cannot be split more than once. We may then notice that MARS, by relaxing the restriction imposed by the second change, forgoes actually the tree structure and is able to capture additive effects.