LAS CEREMONIAS
CARACTERISTICAS DE LAS PREMIACIONES SEGÚN LOS DEPORTES
When developing a risk prediction model for survival data it is essential that the per- formance of the model is evaluated using appropriate validation measures. Although a number of measures have been proposed, there is only limited guidance regarding their use in practice. The aim of this research was to perform a simulation study based on two clinical datasets with contrasting characteristics to investigate a wide range of validation measures in order to make practical recommendations regarding their use.
Based on the simulation study, the measures of predictive accuracy (IBS) and explained variation (V and RIBS2 ) cannot be recommended for use with survival risk models due to their poor performance in the presence of censored data. However, these measures were all conservative with censored data so that high (or low for IBS) values would still be indicative of a good risk model. Of the discrimination measures, K(β) was not biased in the presence of censoring. The performance of D in the presence of censoring depended on the distribution of the prognostic index. Provided that the prognostic index was approximately normally distributed, the effect of censoring on the bias in D was negligible. The C-index was affected by censoring and cannot be recommended for use with data with more than 30% censoring. The sole calibration measure under investigation, CS, was unbiased in the presence of censoring.
All the measures of discrimination, predicted accuracy and explained variation showed sensitivity to the omission of important predictors from a model. However, the ranked-based measure C-index was less sensitive than the other measures. The cal- ibration slope showed only limited sensitivity to predictor omission since the developed risk model effectively re-calibrates itself to compensate for the omitted predictors.
The validation measures differ in their flexibility regarding their assumptions and the form of the risk model. The concordance measure C-index only require that the risk model is able to rank the patients. In contrast, K(β) requires that the risk model was fitted using the Cox proportional hazards model. The D statistic assumes that proportional hazards holds and that the prognostic index is normally distributed. The calibration slope measure, as described, also assumes proportional hazards although
more general approaches are described by van Houwelingen [44]. The measures based on predictive accuracy, IBS, R2IBS, and V , only require that a survival function can be calculated for all patients.
With respect to clinical interpretation, all of the measures considered in this paper can be easily communicated to a non-statistical health researcher, except perhaps for the calibration slope and IBS. The concordance measures can be readily communi- cated in terms of correctly ranking patient pairs, and explained variation measures are intuitive with their percentage scale. The D statistic also has a nice interpretation as it can be communicated as a (log) relative risk between low and high risk groups of patients.
In summary, based on the findings of this simulation study, K(β) can be recom- mended for validating a risk model developed using the Cox proportional hazards model, since it is both robust to censoring and reasonably sensitive to the omission of impor- tant predictors. D can also be recommended provided that the distribution of the prognostic index is approximately normal. It is more sensitive to predictor omission than K(β) and can be calculated for models other than those fitted using the Cox model. The calibration slope can be recommended as a measure of calibration since it is not affected by censoring although it is less sensitive than the other measures to the omission of important predictors. In practice, one might additionally investigate calibration graphically by comparing observed and predicted survival curves for groups of patients. This approach also has the benefit of being easy to communicate.
An important point to note is that the characteristics of the validation data should be investigated before choosing the validation measures. In particular, the level of censoring and the distribution of the prognostic index need to be checked, assuming that the standard model assumptions such as proportional hazards hold. It is not clear that this is routinely done in practice.
3.6
Conclusion
This chapter investigated some of the validation measures that have been used for independent survival outcomes. By means of a simulation study based on two real datasets, this investigation compared their performance against criteria for a suitable validation measure for a survival model. The results in the simulation study provided guidelines for using these measures in practice, particularly when data have censoring.
The next chapter discusses the possible extensions of validation measures that have been used for independent binary outcomes for use with correlated/clustered binary outcomes.
Measures for clustered binary
outcomes
4.1
Introduction
Clustered binary outcomes occur frequently in health care research. For example, subjects could be nested in larger units such as hospitals, doctors, family, or geographic regions. Due to clustering within larger units, outcomes in the same cluster often share some common cluster level characteristics and thus tend to be correlated. Various statistical models have been proposed in the last two decades to model the relationship between predictors and outcomes in the presence of clustering, particularly focusing on how to account for the effect of clustering. These models are typically grouped into two broad classes: cluster-specific and population-averaged approaches [79, 80].
In the cluster-specific approach, the probability distribution of the outcomes is mod- elled as a function of fixed predictors and one or more random terms. The random term represents the effect of unobserved cluster-specific characteristics, which varies across clusters following a specific distribution. This modelling approach is known as the random effects model, for example, random effects logistic model for clustered binary outcomes [81, 82]. In the population-averaged approach, the marginal or population averaged expectation is modelled as a function of predictors, treating the correlation be-
tween the outcomes within the same cluster as a nuisance parameter. Marginal logistic models, with generalized estimating equations [83] for the estimation of the model pa- rameters, are often used for modelling clustered binary outcomes. The estimates from the random effects models have a conditional interpretation, given the cluster-specific random effect, while the estimates from the marginal models have population-averaged interpretation. The conditional estimates from a logistic model can be interpreted as the effect of a unit change in the predictors for subjects belonging to the same clus- ter, whereas the marginal estimates can be interpreted as the averaged effect of a unit change in the predictors for all subjects in the population. Generally, the preference for using one of these two classes of models depends on what type of inference a researcher would like to draw in practice: conditional or marginal [84]. Lee and Nelder [85] and Skrondal and Rabe-Hesketh [86] considered the random effects models as more general form of models for analysing clustered binary data, from which the marginal models can be derived by integrating out the random effects. It is thus possible to obtain both conditional and marginal predictions from the random effects models.
Although the clustering of data within larger units is usually taken into account in explanatory models in aetiological research, it is often ignored in risk prediction research, both in the process of model development and the validation of the model’s performance [87]. This work focuses on the use of random effects logistic models in risk prediction for clustered binary outcomes. To understand the predictive ability of such a model, it is essential to validate its predictive performance. Validation measures for assessing the predictive ability of models for independent binary outcomes are reason- ably well developed; see, for example, Omar et al. [10], Steyerberg et al. [24], Royston and Altman [25], and Harrell et al. [40]. However, very limited research has been conducted to develop validation measures for models with clustered binary outcomes. This chapter discusses possible extensions of some of the existing validation measures that could be used to assess the predictive ability of prognostic models based on the random effects logistic models.
The C-index [45], and the D-statistic [49] are commonly used validation measures to assess the discriminatory ability of prognostic models for independent binary out-
comes. The calibration slope [39, 42] is commonly used to assess whether the model predicts accurately for a group of subjects (calibration), and the Brier score [55] is often used to assess accuracy for individual predictions (predictive accuracy). In this chapter, these validation measures are extended for use with models for clustered binary out- comes. The Hosmer-Lemeshow Chi-squared test statistic [41] is also used frequently to assess a model’s calibration. This test assesses whether or not the observed event rates match the expected event rates in subgroups of model population, where the groups are identified from the deciles of the predicted risk of having the event. However, it is not straightforward to evaluate this measure using a simulation study. Therefore, this measure is not investigated for the models with clustered binary outcomes.
The chapter begins with a brief description of the proposed validation measures for independent binary outcomes, then discusses the estimation of these measures for clustered data. The methods are illustrated using data on patients who had undergone heart valve surgery. A simulation study is conducted to evaluate the performance of the validation measures under various clustered data scenarios.