12. AVALUACIÓ ECONÒMICA DEL PROJECTE
12.2. Cost de personal
This subsection concerns the significance test for the lasso estimates, which was proposed by Lockhart et al. [15]. We will however, consider only the case of the generalized linear models, although its efficiency has not yet been proven by the authors. The covariance statistic tests the significance of a coefficient which enters the model, in a sequence of nested lasso models. Therefore, the use of the predictor- corrector algorithm is essential for finding the correct λ-knots which will give nested models.
The null hypothesis which is being tested here is that ”the current active coeffi- cients in the lasso model are the ones which should be in the model”. The covariance
3.4. INFERENCE OF LASSO 37 statistic tests the significance of a coefficient that is going to enter the model, given the ones that already are in the active set by that time. In other words, we test H0:
”βj= 0 given the other coefficients in the model” (note that the hypothesis is con- ditional; we come back to this later in this section), for the j-th predictor to enter the model. Although the interpretation is actually not so straightforward [15];
The lasso method estimates the coefficients with an adaptive and greedy way. For that reason, the residual sum of squares under the null hypothesis becomes much larger than the X2
1 distribution and thus, the usual chi-squared test is no
longer applicable. Lockhart et al. [15] however, state that the covariance test stat- istic accounts for the adaptive sequence of the lasso estimates and that it balances between the shrinkage and the adaptivity of the lasso. They have shown, that un- der the null hypothesis, the covariance statistics follows asymptotically the Exp(1) distribution, for any linear model. Based on simulation results they argue that this assumption also holds for the generalized linear model cases.
We consider estimates from the equation (3.10), where all the requirements about the exponentiality of the likelihood function, discussed in section 3.1, apply also here. Let Ω be the active set of parameters just before the effect of the λk-knot (that is, the non-zero coefficients from the λk−1-knot;) and let βj be the predictor which is going to enter the model when estimation is done on the λk-knot. That is, when we estimate on the λk-knot the active set will become Ω∪ {j}. We wish to test the significance of the j-th predictor. Then Lockhart et al. [15] define the covariance statistic for a generalized linear model under the lasso penalty to be:
Tk= ⟨I
−1/2S, X ˆβ(λk+1)⟩ − ⟨I−1/2S, XΩβ˜Ω(λk+1)⟩
2 (3.16)
where I =▽2(ℓ(β)), S =▽(ℓ(β)) calculated on the active set Ω. Those equations
can be found by the weighted procedure z = η + I−1S of a generalized linear model (although previously we used another formula for the weights.). Furthermore, λk+1 is the value of the next knot for which the active set Ω changes (becomes Ω∪ {j} ∪ {j + 1}), ˆβ(λk+1) are the estimated coefficients penalized by λk+1 under the Ω∪ {j} active set of predictors, and ˜βΩ(λk+1) are the estimated coefficients penalized by λk+1 under the Ω active set of predictors. Finally, the symbols ⟨, ⟩ indicate the inner product of those matrices [15].
Note that, for testing the significance og the j-th predictor (inserted in the model by the λk-knot), we use the next λk+1-knot. This is done because if we had computed Tk on the λk-knot we should have gotten Tk = 0 because ˆβΩ(λk) =
˜
βΩ(λk), since the solution of the full problem for λk-knot restricted on the Ω set( ˆβ(λk) under the active set Ω and not Ω∪{j};), is the same as the solution of the reduced problem ( ˜βΩ(λk) under Ω). Therefore X ˆβ(λk) = XΩβˆΩ(λk) = XΩβ˜Ω(λk). Moreover, the new predictor will have gained its full power on λk+1-knot [15].
For each covariance statistic Tkwe can compute the corresponding p-value. Big values of Tk mean that the current coefficient has a big impact on the model. This will result to a small p-value, which means that the current coefficient is significant, and the null hypothesis will be rejected. In simpler words, this means that the Ω set (the set of active coefficients without the one being tested), does not contain
38 CHAPTER 3. THE LASSO METHOD all the truly active coefficients, therefore, the new one has to enter the set. Finally, Lockhart et al. [15] show that the degrees of a freedom for a model with k predictors are simply k under the lasso model. Therefore the degrees of freedom for the Tk statistic are k + 1− k = 1.
Discussions have been made around the paper of Lockhart et al. [15] which concern the lasso inference and whether the covariance test statistic can be trusted or not, see Bühlmann et al. [23], Cai and Yuan [5], Fan and Ke [9] and Lv and Zheng [17]. Those articles mainly concern the occurrence of the p-values from the covariance test, as well as their interpretation. For example Bühlmann et al. [23], argue against the interpretation of the p-values by stating that the p- values from the covariance statistic are based on a conditional test (given the other active coefficients in the model) and therefore, cannot be interpreted by the same way as the usual p-values. However, all those discussion-papers congratulate the authors of ”A significance test for the lasso”. For our case, since there is not other publications around the significance of the lasso estimates, we shall consider the paper by Lockhart et al. [15] for assessing the significance of our estimates.
All those methods that were presented in this chapter, will be combined with the case-crossover design for our problem. The next chapter will adapt the methods discussed here to our model. Furthermore, some algorithmic modifications will also be presented, which were mainly applied because of time efficiency problems.
Chapter 4
Adaptation to the Theory
The purpose of this chapter is to adapt the theory discussed so far to our problem. Section 4.1 discusses the way by which the datasets were modified. Furthermore, in sections 4.2 and 4.3, an adaptation of the theory given on chapters 2 and 3 is presented.
4.1 Drug Frequencies and Information Flow
In this section, we give an overview of how the two datasets were evolved during the stages of the analysis. In figure 4.1 the initial frequencies of the 775 drugs are given. Drugs with total intakes less than 100 (below the red line) were excluded from the analysis because the information of those drugs was too little in conjunction with the total 75.000 patients. Those drugs would probably not give any important results but they could significantly reduce the running time of the algorithms, which was a major issue even for the rest of the drugs1. On the right up corner
of the figure, a zoomed version of the figure is printed so that the red line can be more easily seen.
Figure 4.2 shows how the total number of patients were reduced during the analysis. The reason for that reduction was because not all patients could contrib- ute any information on the analysis. After defining the windows of our analysis, which will be discussed in subsection 4.2.1.1, some patients had the same exposure frequencies on both case and control windows. Thus, according to subsection 2.1.1, they could not contribute anything on the analysis so they were removed.
1See appendix A for the challenges of Big Data.
40 CHAPTER 4. ADAPTATION TO THE THEORY
Figure 4.1: This figure shows the initial frequencies multiplied by 75.000 which is the initial total number of patients for both datasets. Drugs below the red line were excluded by the analysis. The red line is placed on y− axis = 100. Note that because the two datasets were generated separately, the total drugs included in the analysis are not the same for both datasets, because the intakes depend on random number generations. This can also be seen in stage 3 in figure 4.2. But the line is approximately the same for both datasets.
Figure 4.2: This figure shows the changes in the two datasets throughout the analysis. Stage 1 is the initial stage where the two datasets were generated and it represents the two complete datasets. Stage 2 is where the contribution check took place. On that stage only the patients that could contribute in the analysis were taken into account, the rest were excluded. Stage 3 is the reduction stage, were only drugs with total intakes more than 100 were taken into account. Note that on stage 3 we also have a reduction in the total number of patients. This happened because, by removing some drugs, some patients ended up with same exposures on both windows and they should therefore be excluded, again.