INTRODUCTION
Case classification is a major step in developing a PPS payment formula. In this section, we describe the construction of a set of Functional Independence Measure-Function Related Groups (FIM-FRGs, or simply FRGs). FRGs partition the population into groups that are medically similar and that have similar expected resource needs.
Facilities will ultimately be compensated for typical cases (i.e., cases that are discharged to the community after a full course of rehabilitation) according to a formula that depends primarily on their assigned class, adjusted by comorbidities, area wage rates, and other hospital characteristics. Here, we classify only typical cases, which we define more precisely in Section 6. (In Section 5, we discuss payment rules for unusual cases, including interrupted stays.) The discharge is our unit of classification: For typical cases, a case and a discharge are the same.
We built on the FRG classification methodology developed in Stineman et al. (1994), extended in Carter et al. (1997), and further extended in Carter et al. (2000). Each source applies classification and regression trees (CART) to
develop a set of FRGs tailored to the latest data and incorporating new refinements. Our 1997 report developed FRGs from 1994 data using the 20 rehabilitation impairment categories (RICs) proposed by Stineman. It also verified that the original Stineman FRGs were stable, effective predictors of resource use. The 2000 report (referred to as the “interim” report) refit FRGs from 1996 and 1997 data, confirmed the predictive power of FRGs on other years’ data, and explored the utility of reformulating the RICs. This section uses four years of data (1996 through 1999), reconfirms the predictive power of the CART- based FRGs, and evaluates the quality of FRGs based on their performance
relative to a “gold standard” model.
Our final classification system is based on the structure of the FRGs enhanced with information about comorbidities (Section 4). FRG development precedes inclusion of comorbidity information and requires two steps:
1. Grouping cases that are clinically similar. Here we started by using the
21 RICs developed in the interim report, Stineman’s 20 RICs, plus another RIC for burns. We recommended adding the burn RIC after we had examined a
- 34 -
variety of changes that Dr. Stineman suggested might improve either clinical or resource homogeneity.
2. Grouping cases that have similar resource needs. Within RICs, we used the
statistical method CART (Breiman et al., 1984) to partition the population of cases into groups that were homogeneous with respect to resource use and functional impairment.
We begin with a brief review of the classification results from the interim report. Then we provide a description of the data, a subsection on modeling methods and results, and finally a subsection on obtaining new FRGs.
REVIEW OF PREVIOUS CLASSIFICATION SYSTEMS
In the project’s interim report, we developed a set of FRGs based on 1996 and 1997 data. These FRGs were meant to support development of the payment formula used in the NPRM and to invite feedback and criticism. We intended to update them based on comments received and on the arrival of 1998 and 1999 data.
Developing this classification system entailed experimenting with alternative ways to define RICs and developing FRGs as predictive models of resource use within the newly defined RICs. We briefly review our findings below.
Rehabilitation Impairment Categories
FIM data contain an “impairment” code that gives the reason for the rehabilitation stay. Stineman et al. (1994) mapped these codes into
rehabilitation impairment categories (RICs). The 1997 study by Carter et al. convened a panel of rehabilitation experts, who generally approved the RIC definitions but offered some suggestions to try if sample sizes became larger. Further, Dr. Stineman wished to explore some modifications to the definition of RIC in a larger sample.
These partitions were reconsidered in the interim report. Because more data were available than had been when the earlier set of RICs was generated, we felt that we should examine different alternatives. We redefined RICs based on
clinician judgment of the clinical homogeneity of the patients, backed up by analyses of resource costs.
We then examined new RICs that split, combined, or rearranged existing RIC groupings. The criterion for whether an additional grouping would be desirable was whether it would lead to more accurate predictions. We evaluated this by fitting cost prediction models within alternative candidate RICs, then seeing if
- 35 -
there was improvement in the mean cost predictions accompanied by a drop in root mean square error. In most cases, we saw very little change in performance; so in view of the already broad acceptance in the rehabilitation community of the existing RICs, we chose to leave those RICs alone. There were also several cases where the new groupings performed substantially worse than the older grouping.
Of the twelve alternative groupings we tried, there were two areas where positive changes seemed large enough to be important, and one case in which a cosmetic change seemed warranted:
1. We defined a new RIC for burns and eliminated burn cases from the “miscellaneous” RIC.
2. We moved the group “status post major multiple fractures” from the
“orthopedic--lower extremity fracture” RIC to group it with other cases in the “major multiple trauma, no brain or spinal cord injury” RIC.
3. Switching RIC 20 (major multiple trauma with injury to brain or spinal cord) and RIC 18 (miscellaneous) enabled us to put the major multiple trauma RICs together in numerical sequence.
Our final definition of impairment codes is shown in Table 3.1.
Function Related Groups (FRGs)
We used the statistical method CART, as described in Breiman et al. (1984), to partition the population of cases within RICs into groups that appeared
homogeneous with respect to resource use and functional impairment. CART was also used in Stineman’s initial development of the FRGs and in our 1997 evaluation of the Stineman methods. The CART algorithm examines a set of independent variables and searches for a partition that explains variation in the dependent variable. The algorithm is recursive. At each step, CART examines all possible two-way splits of the existing groups. It chooses the split that offers the greatest increase in R-squared for that step. CART stops splitting when it thinks it is introducing just noise.
- 36 -
Table 3.1
Final Grouping of Impairment Group Codes into Rehabilitation Impairment Categories
Rehabilitation Impairment Category Impairment Groups
1 Stroke 1.1 through 1.9
2 Traumatic brain injury 2.2, 2.21, 2.22 3 Nontraumatic brain injury 2.1, 2.9
4 Traumatic spinal cord 4.2, 4.21 through 4.23 5 Nontraumatic spinal cord 4.1, 4.11 through 4.13
6 Neurological 3.1, 3.2, 3.3, 3.5, 3.8,
3.9
7 Hip fracture 8.11 through 8.3
8 Replacement of lower extremity joint 8.51 through 8.72
9 Other orthopedic 8.9
10 Amputation, lower extremity 5.3 through 5.7
11 Amputation, other 5.1, 5.2, 5.9
12 Osteoarthritis 6.2
13 Rheumatoid, other arthritis 6.1, 6.9
14 Cardiac 9
15 Pulmonary 10.1, 10.9
16 Pain syndrome 7.1 through 7.9
17 Major multiple trauma, no brain or spinal cord injury
8.4, 14.9 18 Major multiple trauma, with brain or
spinal cord injury
14.1, 14.2, 14.3
19 Guillain-Barré syndrome 3.4
20 Miscellaneous 12.1, 12.9, 13, 15, 16,
17 through 17.9
21 Burns 11
In the interim report, we used CART’s “tenfold cross-validation method” to determine the optimal number of splits in the final classification tree. This method divides the data into ten mutually exclusive sets of equal size, chosen at random. For each set, a tree with k nodes is fit on the other 90 percent of the data, and the square error of the predictions for the other 10 percent is computed and summed over the ten sets. CART then chooses the k with the minimum sum of squares error (equivalently, the maximum R-squared) and fits a tree on the entire data set with k nodes.
Because this method resulted in many splits in large RICs, we traced the values of R-squared as we increased the number of nodes within each RIC. These traces show that the gain in R-squared per node is rather low for those trees exceeding 10 splits. Furthermore, as the size of the tree approaches 20 nodes, the R-squared is often very close to the CART maximum. We used this fact to justify restricting our consideration to models with fewer than 20 nodes in all cases.
- 37 -
Using all our 1997 data, we fit CART to data sets that were at times very large. We required a minimum of 100 cases in each FRG. Raw CART fits produced a total of 359 nodes. This is not too surprising--CART adds nodes as long as the increase in R-squared seems statistically significant. With large samples, even minor differences are statistically significant.
For administrative simplicity, we did not want to create such a large number of groups. In addition, we did not want to create groups characterized by very small intervals of motor or cognitive scales for fear it would encourage upcoding. Because it would not be enough to simply use what CART considered the “best” model, our strategy was to produce some reduced-size models according to their perceived statistical power and practical importance. To accomplish this, we employed four steps:
1. We looked at the R-squared trace produced by CART, which confirmed that there is little to gain in going beyond 20 nodes per RIC. This yielded 232 nodes in total. We therefore bounded the number of nodes at 20 and
considered further reductions with respect to these bounded models. 2. We used a stopping rule based on one prediction standard error (a rule
recommended by Breiman et al., 1984). This reduced the total number of nodes to 143.
3. We tried stopping when R-squared was within .01 of the R-squared for the one-standard-error models. For computational reasons, we used the actual R-squareds, not the cross-validated ones shown in the table. Results were only minimally affected. This reduced the total number of nodes to 104. 4. We looked at the FRG category definitions from the 104-node model and
noticed that predicted costs were sometimes quite close (within $1,500) among lower branches of the same part of the tree. We combined these categories, thereby reducing our tree size to 92 nodes.
We examined the progressive reduction in the number of nodes as a result of each of these steps. We selected two candidate models for further evaluation: a 143-node model arising from CART, and a 92-node reduced model arising from requiring little gain in the R-squared trace and the pairing/tripling up of adjacent branches of the classification tree wherever predicted costs were close. We used simulated payments to assess prediction bias for various combinations of demographic and hospital factors. We found that the 143-node model had only a very small effect on overall accuracy and no noticeable effect on payment for any group of hospitals. Thus, we recommended the 92-node model to HCFA, and we went forward with the 92-node model to develop other aspects of our
- 38 -
proposed payment plan (e.g., wage adjustments, outlier payments). The
definitions of the groups can be found either in our interim report or in the NPRM.
DATA
For the present study, we used the merged MEDPAR/FIM discharge file for calendar years 1996 through 1999, discussed in Section 2. Construction of the file itself is described in Relles and Carter (2002).
Data Set Contents
The merged MEDPAR/FIM data contained several variables we needed for modeling and classification. Table 3.2 identifies these variables and indicates at which stages of the process they were used.
The selection variables define what we think of as the typical case. We exclude transfers to hospitals and nursing homes, deaths, cases of three days or less duration, and statistical outliers. Also, the clinical partitioning and resource use variables needed to be present and in range. Selection was based on the intersection of the rules shown in Table 3.3.
Table 3.4 shows the amount of data we had to work with, before and after selection. Most of the reduction in cases is for ineligibility: deaths,
interrupted stays, or transfers. The last column indicates how many cases were kept with full information. Overall, the reductions due to missing cost data and data quality (present and in-range, excluding statistical cost outliers) are small: about 3 percent in 1996, 4 percent in 1997, 2 percent in 1998, and 3 percent in 1999. Fortunately, the additional reduction due to outliers is
especially small, less than 0.3 percent everywhere, so we do not believe we are contaminating our results by the outlier exclusions.
- 39 -
Table 3.2
MEDPAR/FIM Variables and Stages of Use
Purpose Variable Source Description
Selection
AGE MEDPAR Age
DISSTAY FIM Discharge stay indicator
LOS MEDPAR Length of stay
IMPCD FIM Rehabilitation impairment codes PROVCODE MEDPAR Provider code
PROVNO MEDPAR Provider number
TCOST MEDPAR Total cost estimates, based on cost-to-charge ratios, adjusted by area wage indexa
Clinical partitioning
IMPCD FIM Impairment code
RIC FIM Clinical groupings resulting from impairment code mappings
Resource use
TCOST MEDPAR Total cost estimates, based on cost-to-charge ratios, adjusted by area wage indexa
COGNITIVE FIM Cognitive scoresb Comprehension Expression
Social interaction Problem solving Memory
MOTOR FIM Motor scoresb
Eating Grooming Bathing Dressing--upper body Dressing--lower body Toileting Bladder management Bowel management
Bed, chair, wheelchair transfer Toilet transfer
Tub or shower transfer Walking or wheelchair
Stair ascending and descending
AGE MEDPAR Age
a
These methods are described in Section 2.
b
These individual components are organized into various types of indices, according to body areas and types of impairment. See Table 3.6.
- 40 -
Table 3.3
Rules for Selection of Modeling Cases
Variable Selection Requirement
AGE Between 16 and 105
DISSTAY Indicates discharged to the community LOS More than three days, less than one year. IMPCD, TCOST We excluded cases with wage-adjusted log- cost more than three standard deviations from its average within RIC
IMPCD Contained in impairment list for assignment to rehabilitation categories (see Table 3.1) TCOST, COGNITIVE, MOTOR Greater than zero
Table 3.4
Number of Linked MEDPAR/FIM Records
Calendar Year Matched Records Present and in-Range Eligible Excluding Outliers 1996 171,626 166,645 126,900 126,581 1997 206,032 197,076 148,526 148,142 1998 232,691 228,248 170,266 169,816 1999 257,024 249,941 187,257 186,766
Our numbers of 1996 and 1997 cases are slightly reduced from the numbers shown in Table 3.2 of the interim report, for two reasons. First, in 1996 and 1997 we were working only with the standard motor and cognitive indices and had imputed their values from partial information, if available. Here, because we needed to work with individual components and several alternative subscales, we eliminated all cases that were not complete on all components. This reduced our 1996 counts by about 300 cases and our 1997 counts by about 200. Second, we had allowed discharges to some subgroups of nursing homes in our 1996 and 1997 models. HCFA subsequently decided to classify such cases as transfers, so we adjusted our 1996 and 1997 data sets to exclude them, subtracting about 800 cases in each year.
Case Stratification and Sample Sizes
Previous work had established 21 clinical groupings of patients according to rehabilitation impairment codes within which we would be fitting models. Table 3.5 describes those groupings and the sample sizes available for the modeling effort according to the selection rules in Table 3.3.
- 41 -
Table 3.5
RIC Definitions and Sample Sizes
Rehabilitation Impairment Category 1996 1997 1998 1999
1 Stroke 32,687 35,026 37,012 37,340
2 Traumatic brain injury 1,383 1,629 1,871 2,053 3 Nontraumatic brain injury 2,517 2,863 3,402 3,758
4 Traumatic spinal cord 738 810 930 953
5 Nontraumatic spinal cord 3,782 4,340 5,295 5,837
6 Neurological 4,730 5,717 7,832 8,875
7 Hip fracture 16,017 17,167 18,774 20,627
8 Replacement of lower extremity joint
31,151 37,383 40,931 43,427
9 Other orthopedic 5,292 6,547 8,022 9,310
10 Amputation, lower extremity 4,810 5,423 5,930 6,156
11 Amputation, other 354 477 542 662
12 Osteoarthritis 2,340 2,854 3,983 5,036
13 Rheumatoid, other arthritis 1,169 1,521 1,944 2,350
14 Cardiac 4,097 5,662 6,885 8,104
15 Pulmonary 2,442 3,561 4,340 5,382
16 Pain syndrome 1,321 1,873 2,529 2,993
17 Major multiple trauma (MMT), no brain or spinal cord injury
1,188 1,288 1,540 1,679 18 MMT, with brain or spinal cord
injury 156 222 221 256 19 Guillain-Barré syndrome 240 278 299 313 20 Miscellaneous 10,097 13,398 17,423 21,553 21 Burns 70 103 111 102 Total 126,581 148,142 169,816 186,766
MODELING METHODS AND RESULTS
A meeting of the project’s technical expert panel was convened in May 2000 to review a draft of our interim report. During this meeting we discussed the methods and results of our initial FRG fits. We took from that meeting a set of three basic suggestions for improving on what we had done:
∑ Explore alternative model forms. Develop models to compete with CART in terms of having strong predictive performance.
∑ Consider indices of function in addition to the cognitive and motor
scores. Payment formulas based on these measures might offer better
estimates of cost.
∑ Evaluate out-of-sample performance of the models. An important element of a payment system is whether payment formulas offer accurate prospective estimates of cost.
- 42 -
This subsection discusses the methods we used to implement these techniques and discusses our results in sifting through these methods and reaching
conclusions about which ones to use in developing a set of recommended FRGs. The description and evaluation of our recommended FRGs are deferred to a later
subsection.
Suggestions of the Technical Expert Panel Explore Alternative Model Forms
We expected that classification and regression trees would form the final determination of the FRGs. According to the Balanced Budget Act of 1997, the rehabilitation PPS system is to be based on discharges classified according to FRGs based on impairment, age, comorbidities, and functional capability of the patient, as well as other factors deemed appropriate to improve the explanatory power of FRGs. CART is the traditional method of generating FRGs (Stineman et al., 1997b) and a reasonable method of determining rules to classify patients into groups that explain cost. Various algorithms have been proposed to build tree-structured regression models, all of which tend to be minor variations on CART. CART is efficient at producing simple and effective rules for prediction but also has its limitations. We discuss CART’s strengths and limitations in the next subsection.
Even after computing an unbiased estimate of the predictive performance of a particular regression tree, we found it difficult to judge how much better we might have done if we were not subject to CART’s limitations. We know that R-squared ought to be between 0.0 and 1.0 with the highest values indicative of nearly perfect prediction. But when R-squared is potentially much lower than 1.0, we need a way to judge whether CART has performed as best as could be expected. To further investigate this, we compared CART’s performance with that of other methods.
We compared CART to ordinary linear least squares regression models, generalized additive models (GAM), and multiple adaptive regression trees (MART). The first of these three methods is classic, the second is relatively new, and the last is the latest in prediction methodology. All these models are discussed in the statistical literature. We used the version of GAM (Hastie and Tibshirani, 1990) implemented in the statistical package S-plus. MART is
described in Hastie, Tibshirani, and Friedman (2001); we used software developed by one of its co-authors.
- 43 -
We assessed each model’s predictive performance on preceding and subsequent years. That is, we fit each model (CART, linear regression, GAM, and MART) to one year’s data (1997, for example), and used that model to predict cost for the other years (1996, 1998, and 1999, in this case).
Explore Predictive Ability of Other Functional Measures
The search for an ideal index set to predict cost occurred in two stages. First, we examined individual components. The main question was whether the components entered the model in the expected direction. More specifically, we fit a linear regression model predicting cost from the components of the motor and cognitive scale. We checked to see which, if any, of the components had positive coefficients--implying that greater functional independence increased cost. Such irregularities would flag further investigation of the data
collection process for that component of the scale. We then might reconsider how or if it would be used in the index set. We also fit GAM to the components to look for nonlinear effects.
Second, we experimented with the subscales described in Stineman, Jette, et al. (1997b). These split out the standard motor index into dimensions reflective of different body areas and types of function.
Examine Stability Across Years
Our previous results were based on 1996 and 1997 data and did not give us