• No se han encontrado resultados

The concept of the “best model” has garnered considerable attention in statistical modeling. The trade-off between model complexity and goodness of fit is typically a significant concern. Furthermore, with 438 different variables, a significant challenge

1

The latest release of MIMIC II contains slightly better followup data by supplying a unique subject identifier to track the same patient across multiple hospital visits. For patients who leave the hospital alive on their last recorded visit, however, no followup is known. Work is underway to establish followup status by using the social security death records.

4.1. MODEL CONSTRUCTION 55

exists in finding the best subset of variables to include in a predictive model. An exhaustive search over the space of all possible models is combinatorially infeasible, so a number of commonly used (but not necessarily optimal) strategies are employed to simplify the search problem. I first filtered the candidate variables, then examined univariate models and finally performed backwards elimination to arrive at a final model. Each of these steps is further explained below.

Variable Filtering

Given the inclusive nature of my dataset preparation, a number of the dataset vari- ables were included that had limited availability or were irrelevant. I eliminated a variety of such covariates by applying simple filters. The filters removed three cat- egories of covariates from consideration. The first two filters addressed the problem of missing data: (1) I excluded covariates that were available for less than 80% of the development patients; (2) I removed covariates that had an average per-patient availability of less than than 60% of the patient instances (e.g., tests that were only executed on the first day for a typical patient and unavailable for subsequent days). The third filter removed irrelevant variables: (3) variables that remained effectively constant across the development patients were dropped.

Univariate Analysis

Given a binary outcome of interest, Y , the set of potential covariates was further reduced by selecting the most significant individual covariates. After ranking the covariates based on significance, I used a fixed significance threshold (e.g., p=0.05) to keep the top covariates for inclusion in my initial multivariate model. My ranking was based on the Wald Z score of each covariate obtained from a univariate logistic

regression model trained to predict Y .2 The Wald statistic, Z, is defined as the

coefficient estimate for the univariate model, ˆβ, divided by the estimated standard

error of ˆβ,

Z = βˆ

ˆ

SE( ˆβ).

In addition to their original form, dataset covariates were evaluated for a variety of functional forms by applying transformations. The best form, in terms of the Wald statistic, was used for each variable. If the different transformations were nearly iden- tical to the original form, then the original form was preferred. The transformations considered for each covariate included the following:

• Inverse (i)

• Absolute value (abs) 2

p-values are easily obtained by comparing the squared Wald Z statistic against the χ2

distribution with one degree of freedom

56 CHAPTER 4. METHODS: MODELING • Value squared (sq)

• Square root of value (sqrt) • Logarithm of absolute value (la)

• Absolute deviation from mean (derangement) (am) • Logarithm of absolute deviation from mean (lam)

While most values in the dataset were greater than or equal to 0, the absolute values in the above list were used for the few variables, such as Arterial Base Excess, that do drop below 0. To prevent logarithms of zero, values that were transformed with the logarithm were first shifted by adding a value of 0.0001.

The choice of the specific p-value threshold used for univariate screening warrants additional discussion. Many researchers have suggested using a rather liberal p-value such as 0.25 while others have been more conservative with lower p-values such as 0.05 [29]. The more stringent p-value thresholds avoid covariates of questionable importance, while a more liberal threshold admits covariates that may become im- portant when considered along with other covariates. In general, the amount of data used for this research yields small p-values and most of the variables are significant

at the 0.05 level.3

Collinearity Analysis

Using the top covariates (in their best form), I next screened the covariates to identify collinear or highly correlated covariates. This was done by first keeping only the best variable (based on univariate ranking) from variables that were clearly correlated— such as number of critical systolic blood pressure events over slightly different window lengths. After this first pass was completed, Spearman’s rank correlation, ρ, was used to create a large correlation matrix. Spearman’s rank correlation coefficient is a nonparametric measure of correlation that will detect monotonic relationships. Starting with the most significant univariate variables, correlation coefficients with other variables were examined. If a variable with less importance had a ρ value greater than 0.8, it was discarded.

With the variables that remained after filtering, univariate ranking, and collinear- ity analysis, an initial multivariate model was fit to the data. First, however, the model fitting process typically required manual removal of variables that caused sin- gularity problems. While the collinearity analysis removed strong pairwise correla- tions, in the context of several hundred covariates other more subtle correlations arose that prevented the β estimation process from converging. With these considerations,

3

For example, using the development split of the final dataset described in the previous chapter and logistic regression on mortality, a p-value threshold of 0.05 only eliminates around 15 of the 438 possible covariates

4.1. MODEL CONSTRUCTION 57

an initial multivariate model—typically containing several hundred variables—was trained.

Backward Elimination

With an initial model, backward elimination was next performed to simplify the model and remove variables (covariates) with marginal contribution. Backward elim- ination simplifies a large model by greedily removing the weakest variables. I used Akaike’s Information Criterion (AIC) to eliminate the weakest features. The AIC metric penalizes the log likelihood of a candidate model by subtracting the number of parameters that were estimated for the model. An alternative to AIC is the Bayesian Information Criterion (BIC). BIC places more emphasis on model parsimony by mul-

tiplying the AIC complexity penalty by 1

2log(n), where n is the sample size used to

train the model [24]. Backward elimination proceeded by iteratively eliminating the least significant variable until removing the least significant variable caused the AIC value of the model to surpass the typical AIC threshold of 0. When the AIC thresh- old of 0 was reached, no more variables were removed and the model fitted with the selected set of variables was retained.

Sensitivity Analysis

By progressively increasing the AIC threshold from 0, I evaluated the sensitivity of the model to the number of covariates that it included. A plot of model performance versus the number of covariates provided a reasonable estimate of asymptotic upper bound on performance and the fewest number of covariates necessary to offer strong performance. In the course of the sensitivity analysis, if a simpler model was found that performed comparably to the more complex model, the complex model was discarded in favor of the simpler model.

Documento similar