Fan and Li [17] proposed a unified approach via nonconcave penalized likelihood to simultaneously select and estimate the coefficients without the inherent shortcoming of biasedness in Lasso while still retaining the good features in Lasso. They introduced the Smoothly Clipped Absolute Deviation (SCAD) penalty function.
pλ(|β|) = λ|β|, if 0 ≤ |β| < λ
−|β|2−2aλ|β|+λ2(a−1) 2, if λ ≤ |β| < aλ
(a+1)λ2
2 , if |β| ≥ aλ
This penalty function is constructed in such a way that it retains the good property of Lasso in sparsity with a L1 penalty for small parameter, a constant penalty for large
parameter to overcome the issue of bias and uses a quadratic spline at two knots λ and aλ, where a > 2, to generate a continuous differentiable penalty function.
Although SCAD penalty function produces estimator that has all 3 properties, it is nonconvex. It is therefore more difficult than Lasso in terms of computation. Nonethe- less, the good statistical properties of SCAD motivated many algorithms to be developed and thus SCAD is also commonly used in model selection.
−5 0 5 0 1 2 3 4 5 6 β p ( β )
SCAD and MCP penalty function
SCAD MCP −10 −5 0 5 10 −10 −5 0 5 10 β p ( β )
Penalised least square estimators
SCAD MCP
Figure 2.4: SCAD, a = 3.7 and MCP, γ = 3.7 penalty functions (left panel) and PLS estimators (right panel)
asedness. It provides sparse convexity to the broadest extent by minimizing the maximum concavity. Following the conditions it needs to possess such properties, its penalty func- tion can be expressed as λR|β|
0 (1 − x
γλ)+dx which can be expressed as
pλ(|β|) = λ(|β| − |β|2λγ2), |β| < λγ λ2γ 2 , |β| ≥ λγ
Visually, MCP is a refinement of SCAD where it uses a single knot rather than two to achieve the desired properties. A larger value of the tuning parameter γ affords less unbiasedness and more concavity. As such, MCP is “simpler” than SCAD in a way and in fact any similar penalty function within the region between the MCP and the SCAD, as shown in Figure 2.4, will have the desired properties. Nevertheless, the computation and analytical difficulties of such nonconvex minimization remain a concern.
There are penalty functions that are Lasso-like which make use of combinations of Lasso-type of penalty functions to address specific issues. An example is the Elastic Net proposed by Zou and Hasite [60], which is a combination of Lasso and Ridge penalty to address the issue of grouping of variables. It can be perceived as a two stage procedure which facilitate the selection of highly correlated predictors as a group - either all in together or all out together. There are also multiple-stage methods which make good use of each of the characteristics of the different Lasso-type of penalty functions to perform initial screening and final selection. One example is the two stage procedure proposed by Zhao and Chen [56] which makes use of Lasso’s efficiency in achieving sparsity to perform initial screening and utilizes the SCAD to perform finer selection. Other penalty func- tions are a convolution of others. Particularly, SCAD has a lasso for small parameters to achieve sparsity and a constant penalty for large parameter to maintain unbiasedness.
In summary, penalty function determines the behaviour of the estimator. With good understanding of the characteristics and properties of a basis of penalty functions, one can generate a wide spectrum of different penalty functions via composition or convolutions to meet different needs. It is with this understanding that we formulate our penalty function to overcome the prevalent problem of separation in logistic regression.
Separation and Existing Techniques
Logistic regression is probably the most common statistical analysis after linear regres- sion. Its widespread use could be attributed to the high occurrences of binary or cat- egorical responses. In fact, on many occasions, continuous responses are dichotomized and analyzed via logistic regression, making it pervasive in statistical analysis.
In this chapter, we will discuss separation, a common pitfall in logistic regression analysis. Particularly, we will highlight how separation arises and the consequences of it. We will also list some existing methods that help to ameliorate the issue of separation in data and emphasize the importance of resolving it.
3.1
Separation
According to Albert and Anderson [3], the configuration of data involving categorical response variable can be categorized into 3 mutually exclusive categories:
(a) Separation,
(b) Quasi-separation and
(c) Overlap
In particular, for the archetypical logistic regression model for a binary dependent variable, separation occurs when there exists a subvector of the covariates by which all subjects can be correctly classified in terms of their responses of either 0 or 1. This is equivalent to the existence of a hyperplane passing through the space of the covariates such that on one side of the hyperplane are observations with 0s’ while on the other side are observations with 1s’ as interpreted by Agresti [1]. Pictorially, it can be illustrated [46] as follows: ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● x1 x2 Separation ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● x1 x2 Quasi−separation ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● x1 x2 Overlap
Figure 3.1: Configuration of data involving multinomial dependent response
That is, separation occurs when the categorical response can be perfectly separated by a single variable or a non-trivial linear combination of the variables (separating vari- ables) [36]. Quasi-separation occurs when the response variable are “almost” perfectly separated, with some responses lying on the line of separation generated from the sepa- rating variables.
In both separation and quasi-separation, the maximum likelihood estimates of the parameters associated with the separating variables are infinite. This is the main issue
population are finite [27]. This thorny problem, on many occasions, has often forced researchers to embark on difficult and consequential decisions due to ignorance or lack of skills to tackle it. In addition, it is worth pointing out that the actual estimates of parameter of the separating variables in most standard statistical software packages are dependent on the arbitrary choice of the convergence criteria for the estimation routine in the algorithm of the package. This, in the hand of unsuspecting researchers, can be pernicious and is undesirable.
Besides being pervasive in situations where data is sparse and strong relationships are present, separation can also be found in inconspicuous situations. In fact, Heinze and Schemper [27] showed that separation can actually be quite prevalent under a host of other conditions. In particular, they showed that separation can occur even when the underlying model parameters are low in absolute value. They also showed that the probability of separation is dependent on the number of dichotomous covariates, the magnitude of the odds ratios and the degree of balance in their distribution.
Furthermore, addressing the issue of separation in data is implicitly variable selec- tion. The importance of variable selection, especially in high-dimensional data space, cannot be overly emphasized. Hence, it is important that one is equipped with ways to overcome separation and best if one can have a method that not merely address the issue of separation but also possess the desired properties of variable selection.