1.4. Panorama sobre las evaluaciones en línea
1.4.2. Foco en los profesores
Throughout all of the gazetteering experiments, a common trend we observed was a trade-off between precision and recall. As precision increased, recall decreased, and vice versa. In most cases, the increase was high enough to make up for the decrease resulting in an increase of the F1 score. When examining the strict and lenient phrase F1 scores, problem and treatment were able to exceed the baseline while test was not. Some gazetteers outperformed others by several points. To try and under- stand the reason why, we analyzed the number of terms that overlapped between the gazetteers and the datasets (Table 6). What we found is that gazetteers that had a higher number of terms in the datasets had a higher impact. This is a logical con- clusion, as more relevant data will results in better outcomes. What was surprising is the amount of crossover some gazetteers had. ICD10PCS and CPT codes had a large number of unique terms but very little crossover. This indicates that the vocabulary used in billing codes does not match what is used by medical professionals. Public
lists, such as the WebMD test list and the Southern Cross surgery list, had a high level of crossover. The FDA drug list also had a high level of crossover. This makes sense as drugs are a common treatment for almost any condition.
Precision and recall being reciprocal of each other is not a new phenomenon with respect to machine learning. Many algorithms struggle to find a balance between the two. In the context of our experiments, one potential reason for the trade-off is the use of determiners and pronouns. In the i2b2 dataset, many of the annotations include determiners and pronouns in the beginning or middle. Some examples are: ”an outpatient holter monitor”, ”his chest x-ray”, and ”a few fine crackles at the left base”. With the annotations generated by gazetteering through MIMIC, these pronouns and determiners are not selected. This could result in a case where training too much on the gazetteering data results in the inclusion of proper terms with the exclusion of pronouns and determiners.
CHAPTER 5
THRESHOLDING
In this section, we describe the experiments we conducted on thresholding and the reasoning behind them. The experiments are split into: 1) single-class experiments and 2) multi-class experiments, where multi-class attempts to label all entities within a single model. A full listing can be found in Table 7. The first set of experiments we performed take each individual type and attempt to perform training based on thresholded annotations for a given confidence score. Each experiment uses a set number of epochs. The multi-class experiment does the same as the single-class experiment except with all classes applied at the same time.
5.1 Methods
5.1.1 Preparatory Steps
MIMIC-III database was downloaded. Discharge summaries were extracted and run through pre-processing.
5.2 Experimental Details
Text for discharge summaries was extracted from the ’NoteEvents’ table where ’Category’ was set to ’Discharge summary’. Pre-processing started with an initial step of combining all de-identified terms into single terms that could be easily turned into features, including numbers, dates, and times. Punctuation was then modified to match the format that the i2b2 dataset was in.
Table 7. Thresholding Experiments
Name Model Type Description
Problem Thresholding Single-Class Experiment determining the impact on pre- cision and recall in comparison to baseline for the problem type when including thresholded pseudo-data in the training process.
Test Thresholding Single-Class Experiment determining the impact on pre- cision and recall in comparison to baseline for the test type when including thresholded pseudo-data in the training process.
Treatment Thresholding Single-Class Experiment determining the impact on preci- sion and recall in comparison to baseline for the treatment type when including thresh- olded pseudo-data in the training process. Multi-Class Thresholding Multi-Class Experiment determining the impact on pre-
cision and recall in comparison to baseline for all types when including thresholded pseudo- data in the training process.
Fig. 19. Visual Pipeline Representing Thresholding.
loaded into memory. Sentences less than 8 tokens in length were removed to obtain data similar to i2b2. 400,000 sentences were then randomly selected and sent to files. We began the thresholding process by loading the i2b2 vectors into memory and training a baseline network model. This model is then used to annotate the MIMIC-
III dataset. All annotations above a set confidence level are kept for the next cycle and added to a pool. We then train a new model on the i2b2 data and the pooled annotations from the previous iteration. We repeat this process until the percent difference between generated annotations is less than 5% or until 10 iterations have been run. Analysis is then performed on the final generated model. We completed this process for each annotation type in a single class model and in a combined multi- class model with all three labels. All training occured over 15 epochs. This number was chosen by training the network over a large amount of epochs numerous times and selecting the point where additional training produced negligible results.
5.3 Results and Discussion