• No se han encontrado resultados

CAPÍTULO II: MARCO TEÓRICO

2.3. SISTEMA DE PUESTA A TIERRA

2.3.1. CONECTORES

We develop a three-stage estimation procedure in order to sequentially estimateFR(·),G(·), and f(·), while achieving better prediction accuracy. Our estimation procedure SILFM is a three-stage process consisting ofscreening,aggregating, andnonlinear fittingas follows:

screening aggregating nonlinearf itting

(yi,xi) =⇒ x˜i =⇒ zi =⇒ yi = ˆf(zi).

(4.7)

See Figure 4.2 for an overview of our procedure, whose three stages are given as follows. • Stage (I). Use a Sure Independence Screening (SIS) procedure based on a Hilbert-Schmidt

Independence Criterion (HSIC) to select a set of important featuresx˜i. • Stage (II). Extract the key featureszifrom the selected important features.

Figure 4.2: Path diagram of SILFM estimation procedure.

Y

X

ε

y

X~

Z

ε

x

STAGE I STAGE I STAGE III STAGE II

• Stage (III). Use a kernel ridge regression to build a prediction method based on the extracted key features.

Stage (I)is a fully nonparametric robust screening method based on HSIC. The key steps of Stage (I) include three steps as follows.

• Step (I.1). Use HSIC and its associatedp−value to measure the relationship of each feature individually to the response.

• Step (I.2). Rank marginal HSIC values or theirp−values according to their size (or their degree of dependence to the response).

• Step (I.3). Filter out all noisy features whose size is smaller than a given threshold. The HSIC statistic is a two-variable independence test in Reproducing Kernel Hilbert Spaces (RKHS) (Gretton et al. 2005). As shown in Sejdinovic et al. (2013), the HSIC statistic is consistent when a characteristic kernel is used and is equivalent to the distance covariance (DC) test of multivariate independence when the distance-induced kernel in HSIC is chosen (Székely et al. 2007). Moreover, the HSIC test can be more sensitive than DC when other kernels are used, and the HSIC test can be readily extended to many metric spaces. It should be noted that the use of HSIC is not critical in Stage (I) and any other independence test, such as the fused Kolmogorov filter developed in Mai and Zou (2013), can be used here.

We review the key ideas of HSIC for testing the independence between two random variables. Let Z ∼ PZ and Y ∼ PY be, respectively, random variables on Z and Y, which are two nonempty topological spaces. LetPZ,Y be the joint probability measure of(Z, Y). Let KZ and

KY be kernels onZ andY with respective RKHSsHKZ andHKY. Then, it is well known that

KZ×Y((z, y),(z0, y0)) = KZ(z, z0)KY(y, y0)is a kernel on the product spaceZ × Ywith RKHS

HKZ×Y that is isomorphic to the tensor productHKZ ⊗ HKY. The HSIC ofZ andY is defined as

HSIC(Z, Y)2 = ˆ ˆ

A fundamental result is that ifKZ andKY are universal kernels, then HSIC(Z, Y) = 0 if and

only ifPZ,Y =PZPY.

We construct an empirical estimate of HSIC. Let Hn be a centering matrix In−n−11n1Tn, whereInis ann×nidentity matrix and1n = (1,· · · ,1)T is ann×1vector with all elements 1. LetKZ,n be ann×n matrix with the(i, j)th elementKZ(zi, zj), and letKY,n be ann×n matrix with the(i, j)th element KY(yi, yj). Given an independently and identically distributed sample{(zi, yi)}ni=1, we can construct an empirical estimate of HSIC as the sum of U-statistics

given by

[

HSIC(Z, Y)2 =n−2tr(KZ,nHnKY,nHn).

The estimatednHSIC([ Z, Y)2has some nice statistical properties, which form the theoretical foundation of the HSIC screening procedure. Statistically, asn → ∞,nHSIC([ Z, Y)2 converges to the weighted sum ofχ2(1) random variables in distribution (Gretton et al. 2005, Sejdinovic et al. 2013, Székely et al. 2007). Since different features may have different patterns, such as scale, we use a computationally fast approach based on a spectral method to approximate the

p−value ofHSIC for each feature. Specifically, for the[ j−th component ofxi, we calculate its HSIC andp−value. However, for computational simplicity, it is more convenient to directly use the value of the estimated HSIC to filter out ’noisy’ features. In this case, for a given threshold

γn, we can form the set of important features according to

c

Mγn ={1≤j ≤px :|nHSIC([ Xj, Y)

2| ≥γ

n},

where Xj and Y are, respectively, the random variables for the j−th component of x and y. Theoretically, we will show that our variable screening procedure enjoys the sure independence screening property under some mild conditions. Compared to test marginal screening methods, Stage (I)aims to use a relatively smallγnin order to increase the chance of keepingall important

Stage (II)is not only a dimension-reduction method, but it is also an information aggregation method. Consider the true active setM={1,· · · ,p˜x}for the variables inx˜i. Stage (II) includes three steps as follows:

• Step (II.1). Calculate the (kernel) correlation matrix of the selected features, denoted by

Rx˜ = (rjk)1≤j,k≤p˜x.

• Step (II.2). Use the covariance thresholding method introduced by Bickel and Levina (2008) and the spectral clustering method to partitionMintopz,smultiple disjoint clusters M= ∪pz,s

k=1Mk,s withMk,s∩ Mk0,s =∅ fork 6= k0 ands = 1,· · · , S, whereMk,s is a subset ofMandpz,s is an integer, which may vary acrosss. For eachs, letr˜sbe a given thresholding value and Tr˜s be thresholding operator such that T˜rs(Rx˜) = (rjkI(|rjk| ≥

˜

rs)). LetΠbe a spectral clustering function that maps eachj ∈Mγc n into a unique cluster

Mk,s based on T˜rs(Rx˜). That is, Π(·,·) is defined as Π(j, Tr˜s(R˜x)) ∈ Mk,s for each

j ∈Mcγn.

• Step (II.3). For eachMk,s, we calculate the sample (kernel) covariance matrix of these features with their indices inMk,s, denoted asSX,k,s, and the eigenvalue-eigenvector pairs ofSX,k,s. Finally, we extract the key featureszi based on the scores from the eigenvectors corresponding to therk,s algebraically largest eigenvalues ofSX,k,s.

Stage (II) can be regarded as a novel generalization of the supervised PCA method (Bair et al. 2006), since it conducts standard PCA on marginally selected features with their indices in each cluster. A key difference is that in Step (II.2), we choose a series of0≤ r˜1 < · · ·< r˜S < 1so that we can threshold the correlation matrix at different levels. It is expected that the largerr˜sis, the largerpz,sis. Equivalently, for large˜rs, we only use group features that are highly correlated with each other. This point is similar toτ-separation (Buhlmann et al. 2012) to seek the finest group features where the number of the group feature is chosen byτto deal with the features with strong dependency. Compared to the hierarchical bottom-up agglomerative clustering algorithm for estimating τ-separation, we can take the advantage of the computational expediency when

using the spectral clustering algorithm. As varying a series of thresholds, we extract information from the selected features at different degrees of correlation, which allow us to select the most informative projected features that have the largest prediction power in Stage (III).

4.3 Simulation Studies

In this section, we conducted three simulation studies in order to examine the small sample performance of SILFM and we compare SILFM with other competing methods. In order to compare different methods, we examined two types of performance measures for dimension reduction and prediction accuracy. For each scenario, 100 simulated data sets were generated, while each simulated data set consists of a training set withn= 100and a test set withn = 100. First, for dimension reduction, we consider the true positive rate defined as Ptp = |Mcγn

M|/|M|, the screening accuracy defined as PA = |Mcγn ∩ M|/|Mcγn|, and the true negative

rate defined asPtn =|Mc−Mccγ

n|/|M

c|whereMcand

c

Mc

γn are, respectively, the compliment

ofMandMcγn. Second, for prediction accuracy, we computed the empirical squared prediction

error of the test data set asn−1Pn

i=1(y

test

i −fˆ(xtesti ))2, wherefˆ(·)is the prediction model built from the training set and(xtesti , yitest)s’ are observations in the test set.

We simulated data from differentf(·)s in SILFM with both fixed and random designs forX

and ∼ Nn(0,I) withn = 100, and px = 3000. We generatedX from a multivariate normal distributionNpx(0,Σ). Among thexi, we set the number of informative features asp˜x=s= 600

and created them from two clusters with each havingmk = |Mk| = 300 features fork = 1,2, respectively. Specifically, the correlation structure among all informative featuresΣsconsists of two block-diagonal structures Σs = diag(Σ1,Σ2)where we used a highly correlated structure ρw = 0.9as the within cluster correlation andρb = 0.6as the between cluster correlation. Also, we set the number of insignificant features assc = |Mc| = 2400and the correlation structure among all of them (Σsc) consists of the identity matrix,Isc. Therefore, the correlation structure

Documento similar