Análisis de la ejecución de los principales proyectos de inversión

3. Análisis de la Ejecución Presupuestaria de la Administración Nacional

3.6. Proyectos de Inversión

3.6.2. Análisis de la ejecución de los principales proyectos de inversión

This thesis applies the described time-to-event analysis methods to develop prognostic models in the competing risks setting. Now that the framework for standard and competing risks time-to-event analysis have been described, it is important to provide a general overview of other key issues that arise when developing prognostic models. Statistical considerations which are generally advocated during the development of prognostic models include: evaluating data quality; manipulation and selection of candidate prognostic factors; testing model assumptions; and managing overfitting and optimism (internal validation) (Moons et al., 2012b, Royston et al., 2009, Steyerberg, 2008, Steyerberg et al., 2013). An overview of each of these is now given.

1.5.1 Evaluating data quality

The quality of the data is evaluated to ensure the data is fit for purpose with minimal measurement error (Royston et al. 2009), to enhance the reliability of the prognostic model (Steyerberg et al., 2013). Data should ideally come from a prospective cohort study with standardised measurement of the outcome and candidate prognostic factors (Moons et al., 2012b).

Missing values are common in prognostic model research (Moons et al., 2012b), and numerous statistical techniques exist to manage these. Complete case analysis excludes any participant with missing information, resulting in a loss of statistical power and often invalid parameter estimates (Royston et al., 2009, Moons et al., 2012b), thus is only advised when few observations (say <5%) are missing (Harrell, 2015). Imputation methods utilise multivariable models containing both dependent and independent variables to impute plausible values to replace missing information, these are typically more efficient than complete case analysis (White and Royston, 2009). Multiple imputation methods repeat imputations a number of times to produce multiple

imputed values into the results. Each imputed dataset is analysed using standard methods, and analysis results are combined using Rubin’s rules (Rubin, 2004); Those applicable to prognostic model research (Marshall et al., 2009) are outlined in Table 1.3. Multiple imputation using chained equations (Buuren and Oudshoorn, 2000) utilises iterative applications of regression imputation to recover missing values under a missing at random assumption.

Table 1.3: Rubin's rules for combining estimates from multiply imputed data

Parameter of Interest Equation for combining estimate

Regression coefficient 𝛽 𝛽 = 1

𝑀∑ 𝛽̂𝑖

𝑀

𝑖=1

Estimated within imputation variance 𝑈 𝑈 = 1

𝑀∑ 𝑈𝑖

𝑀

𝑖=1

Between imputation variance 𝐵 𝐵 = 1

𝑀 − 1∑(𝛽̂𝑖− 𝛽) 2 𝑀 𝑖=1 Total variance 𝑇 𝑇 = 𝑈 + (1 + 1 𝑀) 𝐵

Wald test statistic 𝑊 _{𝑊 =}𝛽

2 𝑇 Degrees of Freedom 𝑣 𝑣 = (𝑀 − 1) (1 + 𝑈 (1 +_{𝑀) 𝐵}1 ) 2 P-value 𝑝 𝑝 = 𝐹1,𝑣(𝑊) F-Distribution test.

The above rules, known as Rubin’s rules (Rubin, 2004), are given for combining estimates in 𝑀 > 1 imputed datasets. Where 𝛽̂𝑖 represents the estimated regression coefficient in

the 𝑖𝑡ℎ_{imputed dataset, and 𝑈}

𝑖 represents its variance.

The application of these rules specifically for prognostic model research is discussed further in (Marshall et al., 2009).

It is advised when imputing missing information with a time-to-event outcome that outcome information (a binary variable indicating whether the event occurred and an estimate of the cumulative hazard function) is incorporated into the chained equations (White and Royston, 2009). When competing events are present the chained

equations should include a categorical event indicatori_{, and Nelson-Aalen estimators}

for each of the K cumulative cause-specific hazard functions over time (Bartlett and Taylor, 2016). Logically, when using the cause-specific approach information on all K events are incorporated into the chained equations, as all K events are modelled to estimate the cause-specific cumulative incidence function. As the subdistribution approach only requires the event of interest to be modelled, the chained equations need only include a binary event indicator for the event of interest, and Nelson-Aalen estimate of the cumulative subdistribution hazard function for the event of interest. 1.5.2 Manipulation and selection of candidate prognostic factors

The term “candidate prognostic factor” refers to any prognostic factor which is considered during the development of the prognostic model (Moons et al., 2012b). Existing prognostic factors with known predictive ability, evidenced by prognostic factor research, are usually considered as candidates in prognostic model development (Royston et al., 2009). Manipulation of candidate prognostic factors may be required to aid statistical modelling. For instance, new prognostic factors can be created through the combination of existing ones, such as BMI created from height and weight. Categories within ordered categorical prognostic factors that contain small numbers of participants may be collapsed to give more stable results (Steyerberg 2008). Continuous prognostic factors may require manipulation if not thought to have a linear relationship with the outcome. Transformation of the prognostic factor (for example using fractional polynomial functions) is preferred over categorisation, as more predictive information is retained (Moons et al., 2012b)(Royston et al. 2009). For prognostic models with time-to-event outcomes, centering of continuous prognostic factors helps to ensure a meaningful baseline hazard function (Royston and Altman, 2013) aiding interpretation of relative risks. The reference category in categorical

factors should also be selected with care (Steyerberg, 2008) to ensure a representative baseline to which relative risk comparisons are made (the importance of which is highlighted in Box 1.4).

Often more candidate prognostic factors are available than can sensibly be used in a prognostic model (Royston et al., 2009), and selection methods are required to reduce the quantity. Parsimonious models, which reach a suitable level of prediction but contain fewer prognostic factors, are preferred. Strongly correlated prognostic factors can be removed from models, as they contribute little independent information and explain the same variation (Harrell, 2015, Royston et al., 2009). Numerous selection methods are used in practice (Heinze et al., 2018), however there is no consensus on which is the “best” approach (Royston et al., 2009). Fitting the full model (i.e. including all candidate prognostic factors) avoids selection bias, reduces the potential for overfitting, and leads to reliable confidence intervals, though perhaps at the expense of an unnecessarily complex model (Royston et al., 2009). Automated selection algorithms are commonly applied to reduce the number of candidate prognostic factors included in the final multivariable model, of those available the backward selection method is recommended (Collins et al., 2015, Moons et al., 2015). Backwards elimination is one such algorithm, in which initially a full model is fitted. An iterative process, in which the least statistically significant prognostic factor is eliminated and the model is re-fitted, is repeated until all remaining prognostic factors are significant at a pre-specified significance level. Alternative approaches which combine variable selection with adjustment for overfitting, such as the LASSO, LARS and elastic net methods, may be useful for variable selection, but are not considered in this thesis.

When competing events are present, the selection of prognostic factors may also depend on the analysis approach. Firstly, all K events are modelled for the cause-

contain different combinations of prognostic factors. Secondly, the differences in the subdistribution and cause-specific hazards could result in different sets of prognostic factors being selected for the final model, depending on the approach taken.

To ensure clinical credibility, prognostic models should include clinically important prognostic factors (i.e. those considered to be prognostic by clinicians) regardless of statistical significance (Wyatt and Altman, 1995). Additionally, prognostic factors representing treatments or therapies should ideally be included in the final prognostic models, to circumvent poor prognostic performance if applied to future patients not receiving those treatments (Groenwold et al., 2016).

1.5.3 Testing model assumptions

General assumptions made in multivariable regression modelling equations include linearity and additivity. Linearity refers to the assumption that the effect of a single unit increase in a continuous prognostic factor on the log cumulative hazard function is constant. If this assumption is not appropriate, non-linear relationships may be incorporated by transforming the prognostic factors using fractional polynomial functions or splines (Harrell, 2015). Additivity refers to the assumption that the effects of one prognostic factor are independent of another. If this assumption is not appropriate, interaction terms between prognostic factors can be incorporated into the model (Steyerberg 2008). Additionally, prognostic models with time-to-event outcomes assume both proportional hazards and non-informative censoring.

1.5.4 Managing overfitting and optimism

Statistical overfitting is present when a model is too complex for the amount of information in the data (Harrell, 2015), such that the model is too closely adapted to the data in which it was developed and regression coefficients are overestimated (Royston et al., 2009). Optimism in prognostic models is less likely if developed using

to the number of events (Riley et al., 2019b, Royston et al., 2009, Steyerberg et al., 2013). In particular, it is often recommended that there should be at least ten events per candidate prognostic factor (Peduzzi et al., 1995, Peduzzi et al., 1996). Optimism is identified through internal validation techniques, discussed below, and can be adjusted for using shrinkage methods for a more robust model for the intended population (Steyerberg, 2008).

In document INFORME CUENTA DE INVERSION Año 2013 (página 39-54)