Following the explanation of the basic concept and estimation methods of HBDMs, some of the potential problems that may appear when applying such models must be considered. The first problem that may arise in duration data, due to its nature, is censoring. Censoring occurs when some individuals’ duration data falls before or after the observation period. To illustrate this problem, consider duration data collected from seven drivers from February to October to study the time taken from the day of obtaining a driving licence until the occurrence of their first accident. As demonstrated in Figure 3-3 below, there are many types of censoring. The first type is left censoring (driver D). This type occurs if the duration’s starting time for an individual (the day of having the driving licence) began before the observation period. This type of censoring causes many difficulties when applying hazard-based models because it makes the likelihood function more complex (Washington et al., 2003).
The second form of censoring is right censoring (drivers B, C). It happens if the duration end of an individual lies after the end of observation time. In other words, this type of censoring occurs if the driver has an accident after the end of the study period. Compared to left censoring, right censoring is easier to handle by making a small adjustment to the likelihood function and continuing the estimation by applying standard maximum likelihood methods. Other types may combine both left and right censoring if the individual duration began and ended outside the observation period (Washington et al., 2003).
Furthermore, censoring may occur due to reasons other than incomplete observation. These reasons could include the difficulty of follow-up of some individuals due to numerous causes such as withdrawal (driver E) or loss (driver F) (Washington et al., 2003). Also, in some cases, the event of interest which determines the end point of time (occurrence of an accident in this example) is not clearly defined.
In an attempt to handle the problem of censoring, one of the possible solutions is to make sure that all individuals’ duration data are within the time period of observation. This could be achieved by fulfilling three requirements. First of all, both the start and end points of the study period must be unambiguously identified. The second requirement is that an appropriate time scale needs to be selected. Finally, the event of interest must be clearly defined (Alison, 1984; Cox and Oakes, 1984).
47
Figure 3-3 Illustrations of duration data
Another problem that may arise when modelling duration data is time-varying variables. Using the same example of studying the time until a first accident from the day of possessing a driving licence, one or more covariates could be changed over the study period, such as vehicle type. If this change has not been considered in the model, the estimated parameter could be biased. Moreover, although there are some possible ways of incorporating this problem in HBDMs, the interpretation of duration effects will remain difficult (Washington et al., 2003).
Furthermore, another area of concern is unobserved heterogeneity (frailty). When using HBDMs, an implicit assumption is made that the survival distribution needs to be homogenous across all observations. In other words, it is assumed that a covariate vector captures all deviations in the time duration. This homogeneity will not appear if there are unobserved factors affecting the duration and not included in the covariate vector, causing what is known as unobserved heterogeneity (Mannering et al., 1990). Some of the reasons for not including relevant covariates can be the difficulty of measuring them or they could even be unobservable. In some occasions the analyst may
A B C D E F G Months
1 2 3 4 5 6 7 8 9 10 11
Not censored Right censored Not censored Left censored Withdrawn LostX
X
48 not be aware that a particular covariate is a suitable one to be included in the model. As a result, failure to control for unobserved heterogeneity may yield severe problems such as incoherent estimation of coefficient and standard error, incorrect inference of hazard function shape and wrong estimation of covariate effects (Heckman and Singer, 1984; Box-Steffensmeier and Jones, 2004).
To investigate the appearance of this problem, a common approach used is introducing a heterogeneity term to capture unobserved effects in the model. The role of the heterogeneity term is to incorporate an error term into the model specification. These models are referred to as frailty models. Also, it should be noted that in traditional regression modelling, the error term shows how the expectation of duration depends on the covariates; however, in duration models the error term shows how the distribution of duration depends on the covariates (Blossfeld et al., 2007). Thus, it can be seen that the focus in traditional regression modelling is different to duration modelling.
In the PH model, heterogeneity is introduced as follows:
h(t|xi )= h0(t)exβ+w (29) where ѡ denotes the unobserved heterogeneity term, β denotes an unknown parameter, and x denotes the independent variable. Furthermore, this term is assumed to have a certain distribution over the population such as gamma and inverse Gaussian. Among these distributions, gamma distribution is widely adopted. However, there is rarely any justification behind the selection of a distribution. Also, it should be stressed that the selection of a certain distribution has an impact on the estimation of the model and identification of key parameters (Heckman and Singer, 1984).
In the general form of AFT models, unobserved heterogeneity cannot be introduced. This can be due to the incorporation problem. According to equation (28), it can be seen that there is an error term in this log-linear equation of the accelerated lifetime model. So, it is not possible to add another error term to the equation. This means that the heterogeneity term is not incorporated in general AFT models. However, in specific distributions, including exponential or Weibull, the heterogeneity term can be incorporated because both distributions can be written in PH and AFT metrics (Bhat, 2000). In addition, the prediction of the mean duration following the fitting of frailty model would not be possible (Golder, 2012). Review of previous research shows that there is no evidence of calculations of predicted durations once frailty (unobserved
49 heterogeneity) has been added into their models. Also, when using Stata software to predict the mean duration following frailty model, the following error appears: “unconditional mean predictions for frailty models currently unavailable” (StataCorp, 2007). Thus, for the purpose of achieving Objective 8 of this research, unobserved heterogeneity was not considered in the models.
Finally, to minimize the appearance of unobserved heterogeneity, data collection and data analysis (specifically variable selection) should be carefully performed (Mannering
et al., 1990; Hensher and Mannering, 1994; Washington et al., 2003).