We developed a model, which selected the optimal number of nearest neighbors based on how the nearest neighbor policies performed on various target patients. Depending on our model building data set’s size, we could either use all of it or select a representative sample based on criteria identified by a domain expert.
For our model we used a representative sample based on the magnitude and change of targets’ eGFR. We built the model on cases which were of primary interest to us and had an average smoothed eGFR ∈ [10, 60], because such patients had CKD Stages 3 - 5 and
with a few exceptions were not on dialysis. We calculated the percentage change in the smoothed eGFR values within the time window across which the target cases were defined. The changes were ordered, and a stratified sample of 2000 patients was obtained. Finally, we included the patients with the highest and lowest percentage change in eGFR. Once the 2002 patients were selected, we used the same stratification technique to select 500 patients out of the them. The 500 patients were used as a validation set. The remaining 1502 were used as a training set.
Each of the target patients was forecasted one (F = 1) month into the future. This time horizon allowed for a less subjective accuracy measures selection. Additionally, the iterative nature of our forecasting approach requires good short-term predictions, because inaccurate month 1 forecasts could have a multiplicative effect on the inaccuracies in the forecasts that would follow.
We recorded seven different 1-month forecasts for each target case using the procedure described in §2.3. The policies used N equal to (1) 1, (2) 2 equally weighted,(3) 2 unequally weighted, (4) 3 equally weighted, (5) 3 unequally weighted, (6) 4 equally weighted, or (7) 4 unequally weighted nearest neighbors to forecast all target cases - §2.3.4.
Table 5: Nearest Neighbor’s Optimal Frequencies
1-NN 2e-NN 2n-NN 3e-NN 3n-NN 4e-NN 4n-NN Grand Total
Training Set 512 151 133 124 145 179 258 1502
Test Set 167 45 54 38 55 38 103 500
Once each target’s forecasts were recorded, they were compared to the actual smoothed time series of the target patient. We calculated the time series’ absolute percentage error, rounded off after the tenth decimal point, across the seven forecasted months when using (1) 1 (1-NN), (2) 2 equally weighted (2e-NN), (3) 2 non-equally weighted (2n-NN), (4) 3 equally weighted (3e-NN), (5) 3 non-equally weighted (3n-NN), (6) 4 equally weighted (4e- NN), or (7) 4 non-equally weighted (4n-NN) nearest neighbors. Since there were multiple time-series-based features, we focused on the one which was of primary importance for the
model user. In this model forecasting accuracy was evaluated in terms of patients smoothed eGFR values, because it is the best measure of kidney function [108].
We did not consider strategies that included information from more than four nearest neighbors because any ”increase in the ratio of controls to cases lead to gain in power until a ratio of 4 to 1; after that point, gains in power usually become too small to be worthwhile” [29, 58]. That conclusion might explain the popularity of matching ratios between 1:1 and 1:4 [85].
We deemed a forecasting approach as preferable if it had the smallest primary feature absolute percentage error, which is a measurement commonly used when examining the accuracy of different forecasting methods. We recorded the preferable forecasting approaches for each target case (Table 5).
We used two hundred and forty-five predictors to determine the preferable forecasting technique for each patient. All predictors were based on the closeness of the target and solution cases when one through four nearest neighbors were used, and previously discussed case-base features, which fell into four categories: (1) smoothed and (2) raw laboratory values, (3) appointment history, and (4) comorbidities. The predictors based on smoothed lab results included (1) the change and (2) the squared change between months 1 and 12, (3) the number of peaks, (4) the number of troughs, (5) the average, (6) the squared average, (7) the standard deviation, (8) the variance, (9) the slope, and (10) the squared slope of the smoothed eGFR. The predictors based on raw lab results included (1) the minima, (2) the squared minima, (3) the maxima, (4) the squared maxima, (5) the standard deviations, and (6) the variances of patient’s raw eGFR, albumin, phosphate, potassium, systolic and diastolic blood pressure, and weight.
The appointment-based predictors included (1) the number and (2) the squared number of relevant outpatient appointments attended, (3) the number and (4) the squared number of days the target patient was hospitalized in the twelve-month time window across which the target was defined.
The closeness measures included (1) the smoothed eGFR time series’, (2) the smoothed eGFR time series’ peaks and troughs and (3) the albumin, (4) the potassium, (5) the phos- phate, (6) the weight, (7) the systolic and (8) the diastolic blood pressure averaged dissimilar- ities and squared averaged dissimilarities when a patient is forecasted using one, two, three, and four nearest neighbors for all features used in the retrieval procedure described in §2.3.3. Additionally, we used the difference and squared difference between the dissimilarities across the eight features listed above when the dissimilarity of the nearest neighbors was compared to the averaged dissimilarity of the two/three/four nearest neighbors, when the averaged dissimilarity of the two nearest neighbors was compared to the averaged dissimilarity of the three/four nearest neighbors, and finally when the averaged dissimilarity of the three nearest neighbors was compared to the one of the four nearest neighbors. The difference metrics were standardized, so their estimated coefficients would measure their relative importance.
The five binary comorbidity characteristics included whether the target patient had been diagnosed with (1) diabetes, (2) heart failure, (3) PVD/CVD, or (4) cirrhosis, or is on (5) dialysis. Based on the five binary characteristics, patients fell into one of thirty-two classes (25 = 32). Twenty-five of the thirty-two classes had nonzero frequencies in the
training set and were therefore used as possible predictors. The classes varied from patients without comorbidities and not on dialysis (Class 1) to individuals on dialysis with all four comorbidities (Class 32). The last comorbidity based variable considered was the number of comorbidities each patient had.
The methods we selected to establish a relationship between the preferable forecasting method for a target case and its descriptors in both steps of the model included logistic regression, discriminant analysis, C5.0, CART, and Support Vector Machines (SVM). We used a CART model with a Gini impurity measure for categorical targets, a discriminant analysis model with step-wise parameter selection, and logistic regression models with either forward or backward conditional parameter selection. We considered SVM models with two and three-degree polynomial and RBF kernel types. To account for possible over-fitting, we used regularization parameters C = 1, ..., 10 for each kernel type. We fitted five C5.0 models with different minimum numbers of records per child branch settings (M = 2, 5, 10, 15, 20). The model we selected had two steps. First, we classified patients as having a preference
for the 1-NN, equally (2e-, 3e-, 4e-NN) and non-equally (2n-, 3n-, 4n-NN) weighted multiple- NN forecasting approaches. The second step required two different models, one model for cases that were predicted as having equally weighted nearest neighbor preferences and a sec- ond model for patients that were classified as having non-equally weighted nearest neighbor preferences. Each of the second step models had three classification alternatives, namely use (1) two, (2) three or (3) four nearest neighbors.