• No se han encontrado resultados

ANEJO 2. Sustitución de hidrantes

1.2 UNIDADES REMOTAS

In section 3.4 we referred to the paper by Lockhart et al. [15] about the significance testing of the lasso estimates. We implemented the covariance test statistic after we made some modification for adapting it to our problem. Furthermore, we imple- mented the covariance test statistic both on the bolasso estimates and on the lasso estimates (we revisit this afterwards). In this subsection we develop the covariance statistic of our generalized linear model and we discuss how the implementation was done.

4.3.6.1 Modified Covariance Statistic

When we tried to implement the covariance statistic (3.16), we noticed that the inner product of I−1S× X ˆβ(λ) (consider arbitrary λ) is not feasible; The reason for that is that for a generalized linear model, the information matrix is Ip×p = Xp×NWN×NXN×p, and the score vector S is of dimensions p× 1. Moreover, XN×pβˆp×1(λ) is of dimensions N× 1 [24]. Since I−1S is of dimensions p× 1 while XN×pβˆp×1(λ) is of dimensions N×1, the inner product is not feasible. We consider this as a writing error of the paper.

Assuming that by z = η +I−1S, Lockhart et al. [15] actually mean the equation of the working response, we develop the covariate statistic for our model. The working response for our model was given in equation (4.4). This equation can be written in matrix form as:

z = η + W (Y − µ) (4.11)

where η = Xβ, Y is the response vector, µ is the vector of the fitted values and W is the diagonal matrix of the weights, which its (i, i) element is dηi/dµi. Furthermore, W (Y − µ) is of dimensions 1 × N which makes the inner product with X ˆβ(λ) feasible. Therefore, the covariance statistic becomes:

62 CHAPTER 4. ADAPTATION TO THE THEORY

Tk=

⟨W (Y − µ), X ˆβ(λk+1)⟩ − ⟨W (Y − µ), Xβ˜Ω(λk+1)

2 (4.12)

where W (Y − µ) are computed with respect to the active set Ω. We used this statistic to assess the lasso significance.

4.3.6.2 Border Specifications

Lockhart et al. [15] do not specify the covariance statistic for the first and last coefficient to enter the model, at least for the GLM case. We however, will test the first and last coefficients based on results from the whole theory so far and on what seems like a reasonable modification.

For testing the first coefficient who enters the model, one needs the Ω set at the previous value of λ1. Since the Ω set is the active set of coefficients generated from

λk−1, then for λ1 (which gives the first coefficient to enter the model) the active

set right before this value will be empty. That is, Ω =∅. Therefore, the covariate statistic becomes:

T1=⟨W (Y − µ), X ˆβ(λ 2)⟩ − 0

2 (4.13)

where X is a one column matrix corresponding to the Ω∪ {first to enter}, and the weights are computed accordingly.

For testing the last coefficient that enters the model, we are missing an extra λ-knot. Since the covariate statistic uses the next λk+1for computing Tk, then for testing the last coefficient, which corresponds to λp (where p is the total number of coefficients), we need an extra λ value. By that time, however, we have that the active set Ω has length p− 1 and the active set Ω ∪ {last to enter} has length p. Since the final λp-knot is bounded by λmin= ϵλmax> 0, we can use any λ < ϵλmax as the next knot for computing Tp. Note that, for λ = 0 there is no penalty at the lasso and, therefore, the corresponding estimates are the usual maximum likelihood estimates. The covariance statistic for the last coefficient becomes:

Tp= ⟨W (Y − µ), X ˆβ(λ)⟩ − ⟨W (Y − µ), X

β˜Ω(λ)⟩

2 (4.14)

where λ is chosen by us, such that 0≤ λ < λmin.

4.3.6.3 Implementation of the Covariance Statistic

We implemented the covariance statistic in two ways. The first way was to im- plement the statistic on the output of the bolasso algorithm. That is, on the coefficients that the bolasso estimated as non-zeros (after implementing AIC etc.). The second way was to use the initial datasets and run the significance test on it, that is, simple lasso on the initial data matrices. For both methods, the λ-knots sequence was computed by the predictor-corrector algorithm.

The reason for implementing the significance with two different ways, was for investigating how the covariance statistics works, and for investigating if the bolasso

4.3. THE LASSO DESIGN 63 and the simple lasso will give the same significant coefficients. On the one hand, the covariance statistics is a new method and the resulting p-values are not easy to interpret. On the other hand, according to Lockhart et al. [15], the usual p-values and confidence intervals for the lasso do not exist, therefore any confidence interval obtained by the bolasso would not be accurate. This is actually reasonable if we consider the way the intervals from a bootstrap are obtained. That is, by computing the standard deviation with respect to the bootstrapped estimates for each estimate (or by taking the 5-th and 95-th sorted value of the bootstrapped estimate for a 95% confidence interval). This is not so accurate for the lasso, because at each bootstrap sample the algorithm will choose the optimal λ-knot, via cross validation, and thus, the resulting λ-knot will not be the same at each bootstrap. Therefore, applying a different penalty to the same estimate throughout the bootstrap process will not result to an accurate confidence interval. Furthermore, if we do compute confidence intervals from the bootstraps, the usual standard deviation method for doing this, could be worse than just taking the 5-th and 95-th values (in a sorted sequence). Since lasso is a greedy way of estimating parameters, any normality assumptions needed for computing the standard deviation might have been violated.

We focus more on the output of the covariance test, rather than the actual interpretation of the p-values. For the bolasso estimates, we expect that all the estimates which are chosen not to be zero, will get a rather low p-value from the covariance test. Meaning that they are significant and correctly chosen by the bolasso. For the estimates from the complete dataset we expect that the covariance statistic will choose the same coefficients to be significant, as those chosen by the bolasso to be non-zero.

Chapter 5

Results of the Analysis

In this chapter we present the results of the analysis. The main analysis was based on the NRD dataset, but the results from the RD dataset are also given for comparison. The estimation methods and the significance assessing were the same for both datasets.

Both datasets were treated with two different ways. The first way was the bolasso estimating method. For this method, bootstrap was used for estimating the coefficients of the model. For each bootstrap, the optimal λ-knot was found via cross validation and the modified predictor corrector algorithm that we discussed in subsection 4.3.3.2. After the bootstrap, the optimal threshold was found using the Akaike’s information criterion. The estimates and their confidence intervals were computed by the bootstrap samples. Furthermore, the estimates that were chosen as non-zero from the bolasso were used for significance testing.

The other method concerns a simple lasso application. For this method, the complete matrices from both datasets were used without any bootstrap process. Then the complete predictor corrector algorithm was used for finding the λ-knots. Furthermore, the significance of the coefficients was assessed by the covariance statistics, using all the coefficients this time.

This chapter begins with chapter 5.1 where the results from the bolasso are given. It then continues to chapter 5.2 where the results from the covariance statistic are presented. Chapter 5.3 discusses the results from the estimated log risk ratios. Furthermore chapter 5.4, gives the results from the replication of the bolasso for the N RD dataset. The asymptotic distribution of the covariance statistic is concerned in chapter 5.5. Finally, chapter 5.6 discusses the general results of the analysis.

66 CHAPTER 5. RESULTS OF THE ANALYSIS

5.1 Bolasso Results

In this section a presentation of the bolasso results is given. The results from both datasets are presented and commented. This section focuses on the comparison of the results between the two different datasets, as well as the investigation of characteristics of the bolasso method that significantly differ between the datasets.

Documento similar