Final Considerations - Universidad Católica San Pablo: Unsupervised anomaly detection in 2D rad

The X-ray images of the pelvis structure have different variations due to equipment, sharpness, illumination, and patient movements during the X-ray image acquisition. In this regard, considering the limited number of radiographs, the generation of synthetic data (DRRs) is proposed to increase the accuracy in the detection of anomalies. To this end, the DRRs were projected from CTs considering different variations that try to cover an even larger population.

The proposed method is a generative adversarial network, which has an autoencoder as the generator and an additional encoder to force the latent space consistency. This method aims to guarantee the reconstruction of realistic samples, similar to the input data, and most importantly reconstruct X-ray images without anomalies.

On the other hand, once we obtain the reconstruction of an input X-ray image, an anomaly score is computed between the input and the output images. This anomaly score considers the reconstruction loss, the latent space consistency loss, and the Fr´echet distance loss which measures the visual quality features of the reconstructed image against the input image. In this regard, X-rays with an anomaly score larger than 0.5 are considered abnormal, whereas X-rays with an anomaly score less than 0.5 are considered normal. Finally, the localization of the anomalies is performed by subtraction between the input and the output, then thresholding and morphological operations are applied to obtain the mask of the abnormal region.

Master Program in Computer Science - UCSP 61

4.6. Final Considerations

Chapter 5 Experiments and Results

To demonstrate that our fundamental hypothesis stating that the anomalies can be detected in an unsupervised fashion by using a generative adversarial approach, we used pelvic X-ray images of anteroposterior view as a case of study from several emergency hospitals. Therefore, we can validate the performance of the proposed method detecting different kinds of anomalies with variable shape and texture in the pelvic X-rays.

This chapter describes the experiments and the results obtained based on the proposed case of study. Section 5.1 introduces the experiments to measure the performance of the model using a clinical dataset and a synthetic dataset. Section 5.2 presents the quantitative results based on the AUC metric and the qualitative results based on a visual inspection. Finally,Section 5.3presents a discussion according to the obtained results.

5.1 Experiments

To validate the model performance in the detection of anomalies and specifically to determine if the use of synthetic data improves the accuracy of the model, we performed two experiments. The first experiment was performed only using clinical data for both the training and the testing stages. The second experiment was performed only using synthetic data for the training stage whereas the testing stage were validated using clinical data. In this regard, the first experiment was denominatedclinical experiment and the second experiment was denominatedsynthetic experiment.

5.1.1 Data

We applied the proposed method on the clinical radiographs and DRRs of anterior-posterior (AP) view of the pelvis of males and females. The clinical

5.1. Experiments radiographs were provided by CiTeSoft-UNSA ¹. Additionally, the model were trained with derived data (DRRs) from CTs at 1000shapes. The clinical radiographs were collected from three hospitals: Hogar Cl´ınica San Juan de Dios, Hospital Honorio Delgado, and Hospital Nacional Carlos Alberto Seguin Escobedo. According to the report of the dataset, the X-ray images were taken using different kinds of radiological equipment. In this regard, the X-ray images had different configurations in terms of resolution, contrast and illumination. On the other hand, the pelvic CTs to generate the synthetic dataset showed the L5 vertebra, pelvis, proximal femur, and shaft of the femur (roughly 15 cm of the shaft).

Clinical Experiment.- The clinical data contained 184 AP pelvic X-ray images from different patients, where 153 were normal and 31 had anomalies. In order to have a balanced dataset for the testing stage, 31 normal X-rays and the 31 X-rays with anomalies were separated and used as testing dataset exclusively. Therefore, the clinical experiment included all remaining clinical images which were divided into 110 (90%) training and 12 (10%) validation images.

Synthetic Experiment.- The synthetic data was generated from 149 CTs without anomalies of males and females. From eachCT, 170DRRs were generated, mimicking a total of 25,330 2D radiographs in variations of the AP view. This experiment also included the 122 normal clinical radiographs and all DRRs which were divided into training 22907 (90%) and validation 2545 (10%). The testing dataset were the same dataset from the clinical experiment.

Anomalies.- In our case of study, anomalies refer to prosthesis, screws, nails, zippers and metals which are present in the pelvic X-ray. The dataset contained mainly prosthesis, screws and nails. In some cases, an X-ray image contained more than one anomaly in it (see Figure 5.1).

Figure 5.1: Kinds of anomalies in the pelvis.

1Research Center of Universidad Nacional San Agustin of Arequipa - Peru

CHAPTER 5. Experiments and Results

5.1.2 Implementation Details

Network Details.- The radiographs were resized to 256×256, and the network weights were randomly initialized from a normal distribution with µ = 0 and σ = 0.02. We used the Adam gradient descent optimizer withβ1 = 0.5 and β2 = 0.999. The learning rate was of 2e⁻⁴ and the batch size of 64. The kernel size was 4×4 and the stride was 2 for the the three networks: generatorG, discriminatorD and enconderE, as illustrated inFigure 4.3. We apply a reloading of the weights of the discriminator if its loss is less than 1e⁻5.

Generator G consisted of 7-layers for both encoder G_E and decoderG_D networks.

GE used convolutional layers followed by Batch-Norm and leakyReLU activation. The first layer did not include Batch-Norm, and the last layer did not have activation. G_D used convolutional transpose layers, followed by Batch-Norm and RELU activation as the DCGAN generator. The last layer uses Tanh activation. On the other hand E adopted the same architecture asG_E. Finally,Dwas the same asG_E with the difference that it used a Sigmoid activation in the last layer. In order to analyze the performance of the proposed method, we trained the model using three dimensions for the latent space vector: z_i = 100, z_i = 256 and z_i = 512.

5.1.3 Comparison to Related Methods

We performed a study with clinical and synthetic data, and compared the proposed architecture to related work such as AAE (Chen and Konukoglu, 2018), GANomaly (Akcay et al., 2018) and f-AnoGAN (Schlegl et al., 2019). They were trained in the same experimental setup as our method. Then, the decision process of how to detect images with disturbances was kept as described in the respective publications.

5.1.4 Runtime and Framework

For the experiments, we used a workstation equipped with a 2.30 GHz Intel Core i5 processor, as well as server hardware with NVIDIA Quadro RTX 6000, using just one GPU of 24GB. All related methods and our proposed method were implemented in PyTorch. The training stage converged in 300 epochs forAAE (Chen and Konukoglu, 2018), 400 epochs for GANomaly (Akcay et al.,2018), 500 epochs and 240 epochs for the WGAN and the f-AnoGAN encoder (Schlegl et al., 2019). Finally, our method converged in 450 epochs. The convergence of the experiment of clinical data and the experiment of synthetic data is similar. However, due to the large difference in the amount of training data, the time taken to perform the experiments of clinical data is less compared to the synthetic data experiments.

Master Program in Computer Science - UCSP 65

In document Universidad Católica San Pablo: Unsupervised anomaly detection in 2D radiographs using generative models (página 84-89)