Cost de personal - AVALUACIÓ ECONÒMICA DEL PROJECTE

12. AVALUACIÓ ECONÒMICA DEL PROJECTE

12.2. Cost de personal

This subsection concerns the significance test for the lasso estimates, which was proposed by Lockhart et al. [15]. We will however, consider only the case of the generalized linear models, although its efficiency has not yet been proven by the authors. The covariance statistic tests the significance of a coefficient which enters the model, in a sequence of nested lasso models. Therefore, the use of the predictor- corrector algorithm is essential for finding the correct λ-knots which will give nested models.

The null hypothesis which is being tested here is that ”the current active coeﬃ- cients in the lasso model are the ones which should be in the model”. The covariance

3.4. INFERENCE OF LASSO 37 statistic tests the signiﬁcance of a coeﬃcient that is going to enter the model, given the ones that already are in the active set by that time. In other words, we test H0:

”βj= 0 given the other coeﬃcients in the model” (note that the hypothesis is con- ditional; we come back to this later in this section), for the j-th predictor to enter the model. Although the interpretation is actually not so straightforward [15];

The lasso method estimates the coeﬃcients with an adaptive and greedy way. For that reason, the residual sum of squares under the null hypothesis becomes much larger than the X2

1 distribution and thus, the usual chi-squared test is no

longer applicable. Lockhart et al. [15] however, state that the covariance test statistic accounts for the adaptive sequence of the lasso estimates and that it balances between the shrinkage and the adaptivity of the lasso. They have shown, that un- der the null hypothesis, the covariance statistics follows asymptotically the Exp(1) distribution, for any linear model. Based on simulation results they argue that this assumption also holds for the generalized linear model cases.

We consider estimates from the equation (3.10), where all the requirements about the exponentiality of the likelihood function, discussed in section 3.1, apply also here. Let Ω be the active set of parameters just before the effect of the λk-knot (that is, the non-zero coefficients from the λk−1-knot;) and let βj be the predictor which is going to enter the model when estimation is done on the λk-knot. That is, when we estimate on the λk-knot the active set will become Ω∪ {j}. We wish to test the significance of the j-th predictor. Then Lockhart et al. [15] define the covariance statistic for a generalized linear model under the lasso penalty to be:

Tk= ⟨I

−1/2_{S, X ˆ}_β(λ_k+1₎_{⟩ − ⟨I}−1/2_{S, X}_Ω_β˜_Ω_(λ_k+1₎_⟩

2 (3.16)

where I =▽2_{(ℓ(β)), S =}_{▽(ℓ(β)) calculated on the active set Ω. Those equations}

can be found by the weighted procedure z = η + I−1S of a generalized linear model (although previously we used another formula for the weights.). Furthermore, λk+1 is the value of the next knot for which the active set Ω changes (becomes Ω∪ {j} ∪ {j + 1}), ˆβ(λk+1) are the estimated coeﬃcients penalized by λk+1 under the Ω∪ {j} active set of predictors, and ˜βΩ(λk+1) are the estimated coeﬃcients penalized by λk+1 under the Ω active set of predictors. Finally, the symbols ⟨, ⟩ indicate the inner product of those matrices [15].

Note that, for testing the signiﬁcance og the j-th predictor (inserted in the model by the λk-knot), we use the next λk+1-knot. This is done because if we had computed Tk on the λk-knot we should have gotten Tk = 0 because ˆβΩ(λk) =

βΩ(λk), since the solution of the full problem for λk-knot restricted on the Ω set( ˆβ(λk) under the active set Ω and not Ω∪{j};), is the same as the solution of the reduced problem ( ˜βΩ(λk) under Ω). Therefore X ˆβ(λk) = XΩβˆΩ(λk) = XΩβ˜Ω(λk). Moreover, the new predictor will have gained its full power on λk+1-knot [15].

For each covariance statistic Tkwe can compute the corresponding p-value. Big values of Tk mean that the current coefficient has a big impact on the model. This will result to a small p-value, which means that the current coefficient is significant, and the null hypothesis will be rejected. In simpler words, this means that the Ω set (the set of active coefficients without the one being tested), does not contain

38 CHAPTER 3. THE LASSO METHOD all the truly active coeﬃcients, therefore, the new one has to enter the set. Finally, Lockhart et al. [15] show that the degrees of a freedom for a model with k predictors are simply k under the lasso model. Therefore the degrees of freedom for the Tk statistic are k + 1− k = 1.

Discussions have been made around the paper of Lockhart et al. [15] which concern the lasso inference and whether the covariance test statistic can be trusted or not, see Bühlmann et al. [23], Cai and Yuan [5], Fan and Ke [9] and Lv and Zheng [17]. Those articles mainly concern the occurrence of the p-values from the covariance test, as well as their interpretation. For example Bühlmann et al. [23], argue against the interpretation of the p-values by stating that the p- values from the covariance statistic are based on a conditional test (given the other active coefficients in the model) and therefore, cannot be interpreted by the same way as the usual p-values. However, all those discussion-papers congratulate the authors of ”A significance test for the lasso”. For our case, since there is not other publications around the significance of the lasso estimates, we shall consider the paper by Lockhart et al. [15] for assessing the significance of our estimates.

All those methods that were presented in this chapter, will be combined with the case-crossover design for our problem. The next chapter will adapt the methods discussed here to our model. Furthermore, some algorithmic modiﬁcations will also be presented, which were mainly applied because of time eﬃciency problems.

Chapter 4

Adaptation to the Theory

The purpose of this chapter is to adapt the theory discussed so far to our problem. Section 4.1 discusses the way by which the datasets were modiﬁed. Furthermore, in sections 4.2 and 4.3, an adaptation of the theory given on chapters 2 and 3 is presented.

4.1 Drug Frequencies and Information Flow

In this section, we give an overview of how the two datasets were evolved during the stages of the analysis. In ﬁgure 4.1 the initial frequencies of the 775 drugs are given. Drugs with total intakes less than 100 (below the red line) were excluded from the analysis because the information of those drugs was too little in conjunction with the total 75.000 patients. Those drugs would probably not give any important results but they could signiﬁcantly reduce the running time of the algorithms, which was a major issue even for the rest of the drugs1_{. On the right up corner}

of the ﬁgure, a zoomed version of the ﬁgure is printed so that the red line can be more easily seen.

Figure 4.2 shows how the total number of patients were reduced during the analysis. The reason for that reduction was because not all patients could contribute any information on the analysis. After deﬁning the windows of our analysis, which will be discussed in subsection 4.2.1.1, some patients had the same exposure frequencies on both case and control windows. Thus, according to subsection 2.1.1, they could not contribute anything on the analysis so they were removed.

1_{See appendix A for the challenges of Big Data.}

40 CHAPTER 4. ADAPTATION TO THE THEORY

Figure 4.1: This ﬁgure shows the initial frequencies multiplied by 75.000 which is the initial total number of patients for both datasets. Drugs below the red line were excluded by the analysis. The red line is placed on y− axis = 100. Note that because the two datasets were generated separately, the total drugs included in the analysis are not the same for both datasets, because the intakes depend on random number generations. This can also be seen in stage 3 in ﬁgure 4.2. But the line is approximately the same for both datasets.

Figure 4.2: This ﬁgure shows the changes in the two datasets throughout the analysis. Stage 1 is the initial stage where the two datasets were generated and it represents the two complete datasets. Stage 2 is where the contribution check took place. On that stage only the patients that could contribute in the analysis were taken into account, the rest were excluded. Stage 3 is the reduction stage, were only drugs with total intakes more than 100 were taken into account. Note that on stage 3 we also have a reduction in the total number of patients. This happened because, by removing some drugs, some patients ended up with same exposures on both windows and they should therefore be excluded, again.

In document Utilització de la pell de taronja tractada com bioadsorbent de Níquel (II) (página 88-98)