CAPÍTULO 4. RESULTADOS Y DISCUSIÓN
4.3. DISCUSIÓN DE LOS RESULTADOS
Automated classification tools for (I123)FP-CIT imaging, based on machine learning
methods, have been investigated by numerous authors. Details on the techniques applied in the available literature since 2010, to distinguish between patients with and without pre-synaptic dopaminergic deficit, are summarised in Table 1-3. Also included are details of the image features extracted, reported performance metrics, details of the data upon which the performance figures were derived and information related to the chosen cross-validation
19
technique. The table includes algorithms where training data is based on SPECT images only, multimodality inputs are excluded.
20
Summary of automated classification research for (I123)FP-CIT imaging since 2010 (articles ordered according to accuracy) Authors Image features (if applicable) Classifier Validation data + method Results
Augimeri, Cherubini,
43 local images (12 normal, 31 Parkinson‘s Disease (PD)), no
Analysis of 42nd slice only. Striatal binding ratios in both caudates and putamena, radial features and gradient features. Features are tested for statistical significance (wilcoxon rank) before use in the classifier
All voxels within the image CNN – PD net 701 images from the PPMI database (431 PD, 193 HC, 77 scans without evidence of dopaminergic deficit (SWEDD)).
82 local images (72 PD, 10 non-parkinsonian)
Binding ratios in the putamen, caudate and striatum, striatal volume and length in both brain
SVM, k-nearest neighbour (k-NN), logistic regression
652 images from the PPMI database (209 HC, 443 PD).
Leave-one-out CV
Maximum of:
Accuracy = 97.9%
Sensitivity = 98.0%
21
hemispheres Specificity = 97.6%
Oliveira, Castelo-Branco, 2015 (56)
Image voxels within striatal region of interest
SVM 654 images from PPMI
database (209 HC, 445 PD).
Leave-one-out CV
16 shape and 14 surface fitting features of selected slices, following thresholding. Striatal binding ratios of both caudates and putamena and asymmetry indices were also considered. Features are tested for statistical significance (wilcoxon rank) before use in the classifier SWEDD). 10 fold CV, repeated 100 times. Hyperparameters for SVM chosen through 10 fold CV
SVM:
Tagare, DeLorenzo, Voxel intensities within a region of Logistic lasso 658 images from PPMI Maximum of:
22
Chelikani et al., 2017 (58) interest database (210 HC, 448 PD). 3
fold CV for performance assessment. Parameters chosen through 10 fold CV (nested within outer 3 fold CV).
Accuracy = 96.5 ± 1.3%
Palumbo, Fravolini, Buresta et al., 2014 (59)
Striatal binding ratios for both caudates and putamena (and a subset of these 4 features), patient age
SVM with RBF kernel
90 local images from patients with ‗mild‘ symptoms (34 non-PD, 56 PD). Leave-one-out and 5 fold CV
Maximum of:
Accuracy = 96.4%
Prashanth, Dutta Roy, Mandal et al., 2014 (60)
Striatal binding ratio for both caudates and putamena
SVM, linear and with RBF kernel.
493 images from PPMI database (181 HC, 369 early PD), 10 fold CV, no repeats
12 Haralick texture features within a brain region of interest
SVM ‗Whole‘ PPMI database.
Leave-one-out CV
Maximum of:
Accuracy = 95.9%, Sensitivity = 97.3%, Specificity = 94.9%
Zhang, Kagen, 2016 (62) Voxel intensities from a single axial Single layer 1513 images from PPMI Maximum of:
23
slice, repeated for 3 different slices Neural network database (baseline and follow-up, 1171 PD, 211 HC, 131 SWEDD). 1189 images for training, 108 for validation, 216 for testing. 10 fold CV
Accuracy = 95.6 ± 1.5% (PCA) decomposition of voxel data (after applying empirical mode decomposition) within regions of interest
SVM 80 local images (39 non-pre-synaptic dopaminergic deficit
Downsampled voxel intensities CNNs – modified versions of
24 within striatal region of interest
Naïve-Bayes, Group prototype
116 local images (37 non-PDD, 79 PDD). Leave-one-out CV
189 local images (94 non-PDD, 95 PDD). Leave-one-out CV
Features varied from 1 to 20. Maximum of: Accuracy
208 local images (100 non-PDD, 108 PDD), 289 images from PPMI database (114 normal,
25
Sensitivity = 98.2%
Specificity = 93.0%
Kim, Wit, Thurston, 2018 (68)
Image voxel intensities in a single axial slice
CNN – Inception v3 network
108 local images for training, 45 for hold out testing
Image voxel intensities & image voxels within striatal region of interest
Nearest mean, linear SVM
208 local images (108 non-PDD, 108 PDD). 30 random
permutations CV, with 1/3 data held out for testing
Striatal binding ratios for caudates and putamena on 3 slices
Probablistic Neural network (PNN),
Classification tree (CT)
216 local images (89 non-PDD, 127 PD). Two fold CV, repeated 1000 times
26
For patients with Essential tremor mean probability of correct classification = 93.5 ± 3.4%
Table 1-3 Summary of machine learning algorithms applied to (I123)FP-CIT image classification in the literature since 2010, including reported performance figures. Articles are listed in order of accuracy. Where accuracy values are not available these are grouped towards the bottom of
the table. Table is adapted from (1)
27
A number of trends are immediately apparent from examination of Table 1-3. Firstly, the reported performance figures are universally high. Most accuracy values are greater than 90%, with some authors reporting almost perfect performance. This contrasts with accuracy figures previously summarised for visual image analysis (see section 1.1.5), and for semi-quantification (see section 1.2.1), which were typically in the 80-90% range. These results clearly show that established machine learning algorithms are a promising technology for creating CADx software. As in previous discussions however, these figures should be treated with a degree of caution. Not only is performance likely to be strongly related to the particular case mix in the database but the method of cross validation can also have a significant impact on results (71–73).
The Parkinson‘s Progression Markers Initiative (PPMI) database of SPECT data (www.ppmi-info.org/data) is cited by most authors as a source of validation data. This is perhaps
unsurprising as the data is freely available to researchers, without the need to apply for ethical approval or to go through other lengthy governance processes. As patients were recruited prospectively, following a battery of other tests and screening stages, the diagnostic coding is likely to be relatively reliable. The other advantage of using the PPMI data is that it allows greater comparability between research studies. However, this research database is unlikely to reflect the patient cohorts seen in clinical nuclear medicine. The patient groups included are healthy controls, Parkinson‘s Disease and scans without evidence of dopaminergic deficit (SWEDD). In clinic, a range of atypical Parkinsonisms are seen, as well as DLB and other diseases which do not affect nigrostriatal pathways.
Furthermore, patients were only included in the PD group if their SPECT scan showed DaT deficit (74), which may have excluded any patients for which signs of disease were subtle.
The strict controls on imaging protocols, camera calibration steps and image reconstruction (75,76) also do not reflect clinical reality.
The range of classifiers used by researchers is wide, although support vector machines (SVM), either in conventional linear form, or with a radial basis function (RBF) kernel, appear to dominate. This is likely to be because SVM was considered as a ‗state-of-the-art‘
algorithm up until relatively recently and had been successful in numerous classification problems. The image features extracted and used as input to the classifiers are varied.
However, in most cases relatively simple features are chosen (such as raw voxel intensities and SBRs). This suggests that complex pre-processing is not required to achieve good classification performance. In general the most recent articles gave the highest accuracy figures, with some exceptions. This may be because authors have built upon the findings of
28
previous research work and sought to address limitations that were previously identified. The two-class classification paradigm dominates recent research, where the classifier is trained to separate out two different groups of data. Alternatively, the problem could be considered as a one-class system, where the classifier is trained to find the boundaries of one class within feature space, without explicit reference data from other diagnostic classes.
Overall, analysis of previous literature on automated binary classification of (I123)FP-CIT images confirms that existing machine learning algorithms are associated with high accuracy, which is generally in excess of accuracy figures reported for human observers alone, and human observers assisted by semi-quantification software. However, given the differences in patient datasets, acquisition protocols and analysis methods direct comparison between these different approaches to diagnosis is associated with significant uncertainty.
To date there has not been a direct, comprehensive comparison between semi-quantification methods and machine learning in terms of accuracy or any other performance metrics.
Towey (65) did provide a comparison between two automated classifiers and a limited number of commercial semi-quantification tools. However, the dataset used was relatively small and there was a fundamental bias in the findings in that results for the
semi-quantification approaches were reported from the training data rather than from an
independent test set. Furthermore, no machine learning algorithm has yet been tested in the clinic under realistic reporting conditions (e.g. in support of a human reporter).
If CADx systems based on machine learning algorithms are to be used to benefit patient care these gaps in knowledge need to be filled, which is the main focus of this work.