Classification of echocardiography images using Convolutional Neural Network to assist Kawasaki disease diagnosis

Texto completo

(1)UNIVERSIDAD POLITÉCNICA DE MADRID. ESCUELA TÉCNICA SUPERIOR DE INGENIEROS DE TELECOMUNICACIÓN. MASTER OF SCIENCE IN SIGNAL THEORY AND COMMUNICATION. MÁSTER THESIS CLASSIFICATION OF ECHOCARDIOGRAPHY IMAGES USING CONVOLUTIONAL NEURAL NETWORK TO ASSIST KAWASAKI DIDEASE DIAGNOSIS. CAPUCINE BERTRAND. 2019.

(2) UNIVERSIDAD POLITÉCNICA DE MADRID ESCUELA TÉCNICA SUPERIOR DE INGENIEROS DE TELECOMUNICACIÓN Master of Science in Signal Theory and Communication. Master Thesis CLASSIFICATION OF ECHOCARDIOGRAPHY IMAGES USING CONVOLUTIONAL NEURAL NETWORK TO ASSIST KAWASAKI DIDEASE DIAGNOSIS. Autor: Capucine Bertrand Tutor: Julian Cabrera Quesada Ponente: Departamento: Departamento de Señales, Sistemas y Radiocomunicaciones. TRIBUNAL: Presidente: Vocal: Secretario:. D. Presidente de mesa D. Vocal D. secretario. Madrid, February to June 2019.

(3) UNIVERSIDAD POLITÉCNICA DE MADRID. ESCUELA TÉCNICA SUPERIOR DE INGENIEROS DE TELCOMUNICACIÓN. MASTER OF SCIENCE IN SIGNAL THEORY AND COMMUNICATIONS. MASTER THESIS CLASSIFICATION OF ECHOCARDIOGRAPHY IMAGES USING CONVOLUTIONAL NEURAL NETWORK TO ASSIST KAWASAKI DIDEASE DIAGNOSIS. CAPUCINE BERTRAND. 2019.

(4)

(5) Acknowledgements I would like to take advantage of this page to thanks all my tutor Julian Cabrera Quesada for his valuable advice and help during this master thesis and for guiding me when I was lost. Many thanks also to Tomás Mantecón for his patience, his availability, his support and for always coming up with a solution. Special thanks to the team of 12 de Octubre hospital for helping me making sens of echocardiography images and taking the time to review my work. Thanks to Belen Toral for answering my emails and solving my doubts and to all the team for their assistance during the classification process. I know how time consuming it is and I am very grateful for all the time you spend working on this project. Finally, I’d like to thank all the team of ETSIT’s GTI research group for welcoming me so well in the lab, for their cheerfulness and their patience with my spanish. You all made working on this master thesis as pleasant as it could be and contributed to making my staying in Madrid a great experience.. v.

(6) Abstract The Kawasaki disease is the most common heart condition affecting young children usually under five years old in developed countries and especially in Asia [1]. It damages blood vessels all over the body and results in vasculitis, myocarditis and coronary dilation causing long term heart complications and making it is essential to be able to detect the disease at an early state. One of the methods used to detect Kawasaki disease is to perform a 2D echocardiography to monitor the inflammation of heart muscles and the swelling of coronary arteries. The improvement of this technique is a cornerstone of a good treatment for these children. Based on the success of Convolutional Neural Networks to solve computer vision problems such as images classification, this master thesis aims to develop a system to ease the diagnosis of Kawasaki disease using echocardiographies focusing more specifically on coronary arteries. To do so, we will use deep learning classification techniques such as convolutional neural networks to extract the frames containing images of a coronary artery in a video of 2D echocardiography. These images can later be used to monitor the state of the coronary arteries.. Key words: image classification, convolutional neural network, machine learning, Kawasaki disease, image processing.. vi.

(7) Contents. Abstract. vi. Contents. vii. List of Figures. ix. List of Tables. xii. List of Acronyms 1 Introduction. xiii 1. 1.1. Kawasaki disease . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 1. 1.2. Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 2. 2 State of the art. 3. 2.1. Image Classification with deep learning . . . . . . . . . . . . . . . . . . . .. 3. 2.2. Convolutional Neural Networks . . . . . . . . . . . . . . . . . . . . . . . .. 3. 2.2.1. AlexNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 5. 2.2.2. VGG Networks [2] . . . . . . . . . . . . . . . . . . . . . . . . . . .. 6. 2.2.3. ResNet [3] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 7. 2.2.4. EfficientNet [4] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10. 2.3. Medical Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11. 3 Building the data set. 14. 3.1. Access to echo-cardiograms . . . . . . . . . . . . . . . . . . . . . . . . . . . 14. 3.2. Format of the data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15. 3.3. Design of the data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18.

(8) 4 Development of the system. 22. 4.1. Description of the system . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22. 4.2. Detection of images with coronary arteries . . . . . . . . . . . . . . . . . . 23. 4.3. 4.2.1. Basic Convolutional Neural Networks . . . . . . . . . . . . . . . . . 23. 4.2.2. VGG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28. Classification of coronary artery views . . . . . . . . . . . . . . . . . . . . 30 4.3.1. Basic Convolutional Neural Networks . . . . . . . . . . . . . . . . . 30. 4.3.2. VGG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32. 4.3.3. Resnet50 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34. 4.3.4. Resnet50 with transfer learning . . . . . . . . . . . . . . . . . . . . 35. 5 Conclusion and future lines. 38. 5.1. Conclusion on the results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38. 5.2. Next steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39. Bibliography. 40. viii.

(9) List of Figures 2.1. CNN architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 4. 2.2. AlexNet architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 5. 2.3. VGG architecture [2] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 6. 2.4. VGG error rate performance on 2014 ImageNet image classification challenge[2] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 7. 2.5. residual block with identity mapping layer . . . . . . . . . . . . . . . . . .. 7. 2.6. comparison of ResNet 151 with other 2015’s state of the art networks for image classification. [3] . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 8. 2.7. ResNet architecture [3] . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 9. 2.8. Model Scaling techniques [4]. 2.9. EfficientNet Architecture [5] . . . . . . . . . . . . . . . . . . . . . . . . . . 11. . . . . . . . . . . . . . . . . . . . . . . . . . 10. 2.10 Architecture of Stanford model for skin cancer detection [6] . . . . . . . . . 12 3.1. acquisition of echocardiography . . . . . . . . . . . . . . . . . . . . . . . . 15. 3.2. Heart plan captures with an echo-cardiogram . . . . . . . . . . . . . . . . . 16. 3.3. Examples used to find coronary arteries . . . . . . . . . . . . . . . . . . . . 16. 3.4. Example of echocardiography frames with artery view . . . . . . . . . . . . 17. 3.5. Example of echocardiography frames without any coronary artery view . . 18. 3.6. folder hierarchies for labeled images. 3.7. Distribution of classes in each DICOM file . . . . . . . . . . . . . . . . . . 19. 3.8. Distribution of classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20. 4.1. organization of the final system classifying into two classes . . . . . . . . . 22. 4.2. first basic CNN model architecture . . . . . . . . . . . . . . . . . . . . . . 23. 4.3. Example of learning behavior of the CNN with 3 layers . . . . . . . . . . . 23. 4.4. comparison between a full frame and a cropped frame . . . . . . . . . . . . 24. 4.5. CNN model with 5 convolutional layers . . . . . . . . . . . . . . . . . . . . 24. . . . . . . . . . . . . . . . . . . . . . 19.

(10) x. LIST OF FIGURES. 4.6. 5 layers CNN model after 50 epochs, RMSprop optimizer and 1e-4 learning rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25. 4.7. confusion matrix of 5 layers CNN after 50 epochs using RMSprop optimizer and 1e-4 learning rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25. 4.8. result of 5 layers CNN model after 50 epochs using a SGD optimizer and 1e-4 learning rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25. 4.9. confusion matrix of 5 layers CNN model after 50 epochs using a SGD optimizer and 1e-4 learning rate . . . . . . . . . . . . . . . . . . . . . . . 26. 4.10 result of 5 layers CNN model after 50 epochs using an Adam optimizer and 1e-4 learning rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 4.11 confusion matrix of 5 layers CNN model after 50 epochs using an Adam optimizer and 1e-4 learning rate. . . . . . . . . . . . . . . . . . . . . . . . . 26 4.12 result of 5 layers CNN model after 100 epochs using an Adam optimizer and 1e-4 learning rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 4.13 confusion matrix of 5 layers CNN model after 100 epochs using an Adam optimizer and 1e-4 learning rate . . . . . . . . . . . . . . . . . . . . . . . . 27 4.14 architecture of our VGG16 model . . . . . . . . . . . . . . . . . . . . . . . 28 4.15 evolution of accuracy on VGG16 model during 50 epochs using an Adam optimizer and a 1e-4 learning rate . . . . . . . . . . . . . . . . . . . . . . 29 4.16 evolution of loss on VGG16 model during 50 epochs using an Adam optimizer and a 1e-4 learning rate . . . . . . . . . . . . . . . . . . . . . . . . . 29 4.17 confusion matrix of VGG16 model after 50 epochs using an Adam optimizer and a 1e-4 learning rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 4.18 result of 5 layers CNN model after 50 epochs using a RMSprop optimizer and 1e-4 learning rate to classify 4 classes . . . . . . . . . . . . . . . . . . 30 4.19 confusion matrix of 5 layers CNN model after 50 epochs using a RMSprop optimizer and 1e-4 learning rate to classify 4 classes . . . . . . . . . . . . . 31 4.20 result of 5 layers CNN model after 100 epochs using a Adam optimizer and 1e-4 learning rate to classify 4 classes . . . . . . . . . . . . . . . . . . . . . 31 4.21 confusion matrix of 5 layers CNN model after 100 epochs using a Adam optimizer and 1e-4 learning rate to classify 4 classes . . . . . . . . . . . . . 31 4.22 evolution of accuracy on VGG16 model during 50 epochs using a Adam optimizer and 1e-4 learning rate to classify 4 classes . . . . . . . . . . . . . 32 4.23 evolution of loss on VGG16 model during 50 epochs using a Adam optimizer and 1e-4 learning rate to classify 4 classes . . . . . . . . . . . . . . . . . . 32 4.24 confusion matrix of VGG16 model after 68 epochs using a Adam optimizer and 1e-4 learning rate to classify 4 classes . . . . . . . . . . . . . . . . . . 33.

(11) 4.25 evolution of accuracy on ResNet50 model during 52 epochs using an Adam optimizer and a 1e-5 learning rate to classify 4 classes . . . . . . . . . . . 34 4.26 evolution of loss on ResNet50 model during 52 epochs using an Adam optimizer and a 1e-5 learning rate to classify 4 classes . . . . . . . . . . . . 35 4.27 confusion matrix of ResNet50 model after 52 epochs using an Adam optimizer and a 1e-5 learning rate to classify 4 classes . . . . . . . . . . . . . . 35 4.28 evolution of loss on ResNet50 using tranfert learning from a model trained on ImageNet dataset. The model is trained during 52 epochs using an Adam optimizer and a 1e-5 learning rate to classify 4 classes . . . . . . . . 36 4.29 confusion matrix of loss on ResNet50 using tranfert learning from a model trained on ImageNet dataset. The model is trained during 52 epochs using an Adam optimizer and a 1e-5 learning rate to classify 4 classes . . . . . . 36. xi.

(12) List of Tables 3.1. Resume of the data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21. 4.1. Summary of optimizer performance . . . . . . . . . . . . . . . . . . . . . . 28. 4.2. Comparison of different models on 4 classes classification . . . . . . . . . . 37. xii.

(13) List of Acronyms Adam. Adaptive Moment Estimation.. CNN. Convolutional Neural Network.. DICOM. Digital Imaging and Communications in Medicine.. GAN GPU. Generative Adversarial Network. Graphics Processing Unit.. ILSVRC. ImageNet Large Scale Visual Recognition Challenge.. ResNet Residual Network. RMSprop Root Mean Square Propagation. SGD. Stochastic Gradient Descent.. VGG. Visual Geomery Group.. xiii.

(14) Chapter 1. Introduction 1.1. Kawasaki disease. The Kawasaki disease took the name of the first pediatrician to report it almost 60 years ago. It affects mostly children under 5 years old and leads to inflammation of blood vessels throughout the body. It is the most common heart disease to affect children in developing countries and although it has been observed all over the world it is particularly expressed among Asian population. The firsts symptoms of the disease are fever, rash, irritation and inflammation of the mouth lips, throat and eyes and swelling of hand and foot. However, the main complication is heart distress since the symptoms of Kawasaki disease include a rapid heart rate, a collection of fluid in the heart, inflammation of heart muscle and coronary artery swelling. Theses symptoms weaken the artery wall and can lead to heart attack and other serious heart complication that will require a follow up treatment even after the recovery. To avoid such complications it is important to make a diagnostic as early as possible, especially because once detected it can be cured in a couple of weeks. Unfortunately no test has yet been designed to detect the disease and the diagnosis rely on clinical observations and ruling out other possibilities. Therefore, there is a real need for a tool capable to detect Kawasaki disease with an automatic test to speed up the diagnosis process. One research field to design that kind of test is the use of convolutional neural networks that has been proven very efficient to classify normal images and are more and more used for medical applications. To detect inflammations, the coronary arteries are monitored with ecocardiography. Indeed, this procedure is cheap, non invasive, radiation-free and require no specific preparation of the patient making it very kids friendly. That’s why it would be very interesting to be able to exploit more theses images during the diagnosis process. In this work we are going to use those images to develop a system to process echocardiography videos and identify frames showing interesting elements for the diagnosis of Kawasaki disease..

(15) 2. 1.2. Chapter 1. Introduction. Objectives. The objective of the master thesis is to take advantage of the good results of Convolutional Neural Networks when applied to computer vision to assist the diagnosis of Kawasaki disease. We expect the analysis of echocardiographies with computer vision to underline specific structures and features that can help in the detection of Kawasaki disease. To do so this master thesis focus on the development of neural networks of different depth designed specifically to classify images of echocardiography. First, to detect frames where coronary arteries can be seen and then to detect which one of the left or right coronary is represented in the image. To be able to train supervised deep learning models for classification, we also plan to build a data set from echocardiography files, labelling each frame to describe what can be seen on the image and then selecting only the interesting one for our project..

(16) Chapter 2. State of the art 2.1. Image Classification with deep learning. Image classification is a very common machine learning task and has been tackled since the very beginning of computer vision in the 1960’s, starting with Seymour Papert’s Summer Vision Project aiming to perform segmentation between an object and its background. Even though the project was not successful it marks the beginning of computer vision as a scientific field [7]. Since then, numerous computer vision techniques were developed to solve the image classification problem. If until the early 1990’s pattern recognition techniques, using for example decision tree or quadratic discriminant analysis was considered as state of the art techniques regarding images classification,[8] today Convolutional Neural Networks are the most common solution and the architecture is still being researched and improved. The first version of a Convolutional Neural Networks was introduced in the 1980’s by Kunihiko Fukushima with the Neocognitron. This architecture was then improved by Yann LeCun in the late 1990’s [9]. However, it is only since 2012 and the success of Alex Krizhevsky architecture AlexNet at the Large Scale Visual Recognition Challenge that CNN started gaining popularity, taking advantage of the new computational power and memory capacity of computers. In this chapter we are going to focus on the results concerning image classification as it is the task we are trying to complete in this master thesis.. 2.2. Convolutional Neural Networks. Convolutional Neural Networks (CNNs) are widely used in computer vision for tasks such as images classification and object detection. They are deep, feed forward artificial neural networks using alternatively convolution and pooling layers to extract features from images and assign weights to each feature to be able to differentiate one image from others. As others neural networks, CNNs are composed of an input layer, an output layer and several hidden layers in between. On the other hand, unlike multi layers perceptron.

(17) 4. Chapter 2. State of the art. networks, CNN take into account the spatial dependency of pixel values by working with matrices and 2D filters. Only the lasts layers are flattened and use fully connected layers. Figure 2.1 describes the architecture of CNN adapted for a classification task.. Figure 2.1: CNN architecture. The convolution extract features and inherent structure of the image by shifting a 2D filter through the matrix of its pixel values and dividing it into channels of lower dimension that carry the information extracted. The filter is designed to capture spatial information of the images, underlying structures and non linear relations between the parameters. This process converts the pixel matrix into a form that map all characteristic of the image and that is easier to process without loosing critical information for the classification. The pooling layer is used to reduce the computational cost and memory requirement of the network but also to increase its resistance to pattern translation as it only keeps "dominant" information. The number of channels stays unchanged but their dimension is reduced thanks to the selection of the most important features. A kernel is used to extract features either by averaging the value of every parameters in the kernel (average pooling) or by keeping only the maximum value (maxpooling). Maxpooling is usually preferred as it performs noise reduction along with dimension reduction while average pooling only reduces the dimension. Dropping part of the parameters also helps to limit over-fitting as it reduces the number of parameters to learn and therefore the chances to learn over specific patterns. Another technique to reduce over fitting in deep neural networks is the batch normalization. It limits the covariate shift in the distribution of parameters by normalizing the output of each convolutional layer so they have always the same mean and variance. This is done to stabilize and speed up the learning process, as the network don’t have to adapt to a new distribution of parameters at each epoch [10]. A convolutional block is the combination of convolution and pooling layers. The fist convolutional layers of the network will capture high level features such as colors or edges and, as the number of convolutions layers increase in the network, more complex and abstract, lower level, features are captured. The last layers of a CNN are composed of a flattening layer followed by a conventional fully connected network lighten with dropout layers. It learns a non linear function adapted to the specific task the network has been.

(18) 2.2 Convolutional Neural Networks. 5. designed for. Dropout layers have the same role as Maxpooling layers, they discard the lightest weights to avoid over fitting. In our case the last part of the network is designed to classify images [11] . 2.2.1. AlexNet. One of the most basic CNN architecture is called AlexNet [12] and was first presented during 2012 Large Scale Visual Recognition Challenge and train on ImageNet data set. ImageNet is the most challenging data set concerning images classification. It is an online image data set, free of access. It famous for being used in the Large Scale Visual Recognition Challenge whose goal is to do object detection or localization and images classifications. Most of the last innovation concerning computer vision were tested in the ImageNet data set and entered the ImageNet Large Scale Visual Recognition Challenge (ILSVRC). Since 2012, all winner of the challenge use CNN architectures. It uses the following data: • 1000 categories • 1.2 million training images • 50 000 validation images • 150 000 testing images AlexNet is composed of 5 convolutional layers using ReLu activation function and 3 maxpooling layers. The output of the first two convolutional layers is normalized (Figure 2.2). This network was the winner of 2012 ILSVR challenge for images classification with a top 1 error rate of 37.5% and a top 5 error rate of 17%. Given that the second best results for this competition was obtained with a combination of SIFT+FVs that achieved a top 1 error rate of 45.7% and a top 5 error rate of 25.7% we can see how AlexNet was a real breakthrough. As of June 2019, Alex Krizhevsky paper about ImageNet classification with CNN has been cited more than 40 000 times [13]. It inspired others CNN architecture such as VGG network or ResNet that we will detail later on.. Figure 2.2: AlexNet architecture.

(19) 6. 2.2.2. Chapter 2. State of the art. VGG Networks [2]. VGG network is named after the Visual Geometry Group of Oxford University that have designed it. It was presented in the 2014 ILSVR challenge to classify images from imageNet dataset. It is a deep convolutional network that uses convolution layers with 3x3 convolution filters. VGG architecture is composed of 5 convolutional blocks including up to 4 convolution layers and a maxpooling. Different architectures are available with between 11 and 19 layersas detailed in Figure 2.3. Each convolution layer is followed by a Batch Normalization right before the activation function. VGG network takes advantage of new GPU computational power to use deep architectures that can capture low level features while being very uniform. In 2014 it was the first architecture to have an error rate bellow 10%. The most common forms of VGG architecture are VGG16 and VGG19 (Figure 2.4).. Figure 2.3: VGG architecture [2].

(20) 2.2 Convolutional Neural Networks. 7. Figure 2.4: VGG error rate performance on 2014 ImageNet image classification challenge[2]. 2.2.3. ResNet [3]. As previously seen, increasing the depth of neural network allows to have a better analysis of images and to capture more features and details about their structure, leading theoretically to better results in classification. However, deep neural network have sub-optimal performances because of the vanishing gradient problem. After each training epoch, the network’s weights are updates using back-propagation. This technique use the gradient descent algorithm, exploiting the chain rule to find out how to optimally reduce the error between the predicted output and the real output of each neuron. Because the chain rule is used, as we go deeper in back propagation, the update coefficient converges toward 0 and the first layers are not updated. ResNet offers a solution to the vanishing gradient problem by using identity mapping layers, illustrated in Figure 2.5. Then during back propagation the error gradient goes through the identity function and is multiplied by 1. Thanks to this system very deep neural network can be build and the deeper ResNet has 152 layers, 8 times more than the deepest VGG.. Figure 2.5: residual block with identity mapping layer. The architecture of ResNet is inspired by the VGG architecture, using 3x3 convolution filters and stacking up residual blocks that are composed with convolution-ReLuconvolution and then adding the prediction to the original signal. As the network is very deep, a bottleneck design in used to reduce the complexity, meaning that the dimension of the channel is reduced from one residual block to the next one. However, to use identity mapping the input and output of the residual block have to be of the same dimension since they are added together before the activation function (see.

(21) 8. Chapter 2. State of the art. Figure 2.5). This is the source of many dimension error when implementing the network. To tackle this issue we have to artificially change the dimension of one tensor to make the input and the output match. This can can be done either using zero padding to fit the bigger dimension or perform a linear projection with 1x1 convolution filter on the input to fit it into the output dimension.. Figure 2.6: comparison of ResNet 151 with other 2015’s state of the art networks for image classification. [3]. ResNet model won the 2015 ILSVRC competition in image classification, object detection and localization with an error rate below 4% in classification (Figure 2.6). It also the won the MS COCO 2015 competition for object detection and segmentation. This architecture, illustrated in Figure 2.7, inspired many state of the art network still used in 2019..

(22) 2.2 Convolutional Neural Networks. Figure 2.7: ResNet architecture [3]. 9.

(23) 10. Chapter 2. State of the art. 2.2.4. EfficientNet [4]. In June 2019, the best accuracy result is achieved with EfficientNet and got 84.4% of correct prediction on the ImageNet data set, over-passing by only 0.1% the GPIPE network since December 2018 while being 8.4 times smaller and 6.1 times faster. EfficientNet is a Convolutional Neural Network that uses Model Scaling to increase the accuracy of already existing networks. Model scaling means changing the dimension of the model while keeping its original architecture. For example in the reference paper describing VGG network different scales are offered, changing the depth of the network from 11 weight layers up to 19 weight layers (cf Figure 2.3). Model scaling methods are described in Figure 2.8. It is generally done by either: • adding up more layers to the network to take into account more complex features, • adding up more channels at the output to reduce depth, easing the training process and the wide output take into account more fined grained features • or by increasing the resolution of images at the input to take into account small patterns and improve accuracy.. Figure 2.8: Model Scaling techniques [4]. The problem of model scaling is that it can require a lot of setting to adapt the architecture to the new scale. Indeed in neural networks the optimal width, depths and resolution depends on each other and changing only one of these parameter can results in sub-optimal accuracy and efficiency. On the other hand, scaling all three dimensions can increase drastically the number of parameters making the network impossible to compute. EfficientNet goal is to take advantages of model scaling good performance while keeping good accuracy and efficiency results, and keeping the number of parameter under control. EfficientNet, scales up all dimensions with fixed scaling coefficients using a compound scaling method. This scaling method has been applied to different well known CNNs such.

(24) 2.3 Medical Images. 11. as ResNet50 and improve the top 1 accuracy of image classification on the ImageNet data set from 76% to 78.8%. The final architecture of an EfficientNet is described in Figure 2.9 where MBConv denote mobile inverted bottleneck convolutional layer. It has been developped from using Neural architecture search, a system that automate the design of neural networks.. Figure 2.9: EfficientNet Architecture [5]. 2.3. Medical Images. In 2013, IBM’s Watson device was one of the first machine learning application to the health care industry and was used to assist doctors in decision making during lung cancer treatment. Today, research on machine learning applications are carried out in all fields of the health care industry, from cancer diagnosis to the development of new drugs and treatment recommendations. The automatic analysis of digital images and videos is a crucial applications of machine in medical field as the need for diagnosis assistance grows bigger. Recent application of machine learning to medicine involves image classification, object detection and segmentation for the analysis of any kind of anatomical structures such as brain, breast, heart, kidney, muscles and more [14]. Both Google and Stanford have developed machine learning algorithms using CNN model to identify respectively cancerous tissues and diabetic retinopathy in retinal images [15]. Microsoft also work on tumor detection within their Inner Eye project whose first publication was released in 2008. Stanford model was developed with GoogLeNet Inception v3 that uses a CNN architecture (Figure 2.10). It was pre-trained with the ImageNet database from 2014 ILSVR competition to make up for the lack of specific data and to limit the impact of images variability on the performance of the algorithm. Then it was adapted with transfer learning to the specific application for the detection of cancerous tissues by removing the last fully connected layer and training over melanoma images distributed in 757 disease classes. The output of the algorithm uses a vector with the probability that the skin lesion image belongs to each class to classify the image between benign and malign. The CNN algorithm achieve an accuracy of 72% while dermatologist had an accuracy of at best 66% [6]..

(25) 12. Chapter 2. State of the art. Figure 2.10: Architecture of Stanford model for skin cancer detection [6]. Google AI group is also very active in the research of machine learning solution for cancer detection. Their last project using computer vision aims at detecting Diabetic Retinopathy and also used transfer learning with pre-trained GoogLeNet architectures trained on ImageNet data set. The network was adapted to the specific application by removing the last fully connected layer and using the model as a feature extractor. AlexNet and VGG16 architecture were also explored following the same process. Regarding binary classification, GoogLeNet achieved the best performance with a sensitivity of 95% and a specificity of 96%, however, when trying to scale up the network to detect more classes, the network behaved poorly. Although this result can be partially explained by the limitation of CNN capacity to detect very fine features, poor labeling and images artifacts can also be responsible for the results [16]. Concerning echocardiography images, in 2018 a research group trained machine learning algorithm to automatically perform basic tasks of echocardiography interpretation such as identify different heart views and segment cardiac chambers. Such tool would reduce the cost of echocardiography exam and allow faster and more accurate interpretation of images, extending it use to more rural area were expert are not always available. Accurately classifying echocardiography views is also a first step to gain the confidence of medical community and toward more developed tool to assist diagnosis. Indeed, if an algorithm can learn to differentiate heart views, it can probably also learn what is normal or not with those views. The view classification was done training different CNN algorithms initialized with random weights and using transfer learning from ResNet50 and VGG16 model pre-trained on ImageNet dataset. The best results were obtained with a single U-Net24 composed of 3 CNNs achieving an accuracy of 94.4% on the test set. The images where segmented befeore entering the network to extract only the echocardiography area. Both Resnet50 and VGG16 models achieved lower overall test accuracy of 91.36% and 83.67% respectively, despite having a more complex and deeper architecture. This study also uses semi-supervised Generative Adversarial Networks to develop classification algorithm with few labeled data and large data sets, achieving an accuracy over 80% of correct classification with only 4% of the data set labeled [17]. Beyond classification of medical images to detect some defined structure, the next step is to train algorithm to no only detect images with a given pattern but to be able to point out where it is in the image. This technology is used for example in the InnerEye Project.

(26) 2.3 Medical Images. 13. where image segmentation is used to find out precisely where a brain tumor is located in order to direct the radiation precisely on the injured tissues and let the rest of the brain untouched. Currently the segmentation of brain tumor is done manually from slices of 3D brain scan which is very time consuming. Automatic segmentation would improve the precision of the segmentation and give more free time to doctors to focus on the best way to cure the patient. Indeed, the main goal of applying machine learning technique to the medical field is not to be able to replace doctors by automatic tools but to gain time and precision by taking advantage of the latest technologies to assist doctors during the diagnosis and treatment process so they have more time to find out what’s best for their patients and make the practice of medicine more human..

(27) Chapter 3. Building the data set 3.1. Access to echo-cardiograms. One of the main problems with machine learning today is collecting enough data to feed the neural network so it can actually learn and recognize the structure of the data. Machine learning applied to health care problems is especially concerned by this issue and accessing an exploitable data set of medical images can be challenging. Indeed medical data are not easily accessible because they are considered as "personal data" and under this status they are protected and cannot be distributed without the patient agreement. On the other hand, transform data to remove any personal information about the patient could make the data less powerful as the meta data is often useful for more advanced processing. In our case, even though we only use the echocardiography images, it is particularly difficult to collect data because the disease concerns very young children. Indeed, it is a sensible population to work with as they cannot give consent themselves and are classified as high ethical risk by the 2018 European union report about Ethic and data protection [18]. Moreover, Kawasaki disease is not systematically diagnosed with echocardiographies. This exam is mostly used to monitor the evolution of coronary inflammation after the diagnosis has been done. The consequence is that it is very difficult to find data for our problem in the internet so the only solution was to build the data set from scratch. For this thesis, we worked with the pediatric cardiology department of Madrid’s 12 de Octubre hospital that provided us with echocardiography images and helped us during the labelling process with their expertise on heart images..

(28) 3.2 Format of the data. 3.2. 15. Format of the data. The echocardiographies are provided in DICOM files, (Digital Imaging and Communications in Medicine), the standard format used to handle medical imaging. This format allows to normalize the way medical images are stored and keep demographic information about the patient as well as details on the procedure used. It is convenient to work with this format because this way, each file is independent and it is easy to identify its origin, track back the patient and the acquisition parameters. For this study we are only interested in three kind of fields: • the video, represented by an array of array of frame’s pixels values, saved in the field "pixel array". • the dimension of the image saved in the fields "Rows" and "Columns". • the scale of the image, determined by the real life distance covered by a pixel saved in the field "Pixel Spacing" or "PhysicalDeltax" and "PhysicalDeltay". This information will be useful later on to measure the diameter of a coronary artery. Since we are dealing with videos of echocardiographies the firsts step was to extract the video from the DICOM file and transform it into png images stored in a folder of frames. On some frames we have additional information written in color, therefore each frame is composed of three layers, since the part representing the coronary artery is in grey scale, we only use one layer. Once we have the frames, we can start building the data set and label the images according to which coronary artery is visible. However, an echocardiography can capture different plans of the heart, for this study we want to monitor the output of the two main coronary arteries, so we are going to use echocardiographies representing the plane B as described in the Figure 3.2. In the scheme, the left artery is circled in green and the right artery is circled in red. We are using echocardiographies of the parasternal short axis plan, all acquired with the same device and using the angle described in Figure 3.1.. Figure 3.1: acquisition of echocardiography.

(29) 16. Chapter 3. Building the data set. Figure 3.2: Heart plan captures with an echo-cardiogram. In Figure 3.3, an example of left (circled in red) and right (circled in green) coronary arteries. We are going to look for that kind of pattern in the frames to identify coronary arteries.. (a) right coronary artery. (b) left coronary artery. Figure 3.3: Examples used to find coronary arteries. For this work, we consider five categories: 1. files that don’t represent the interesting area for our study and therefore where we can’t see coronary arteries. Some examples can be seen in Figure3.5 (a), (b) and (d). 2. Files that represent the correct area but don’t feature any coronary artery. An examples in Figure3.5 (c). 3. Frames where we can see both coronary arteries. Some examples can be seen in Figure3.4 (a) and (b). 4. Frames where we can only see the right coronary artery. Some examples can be seen in Figure3.4 (c) and (d). 5. Frames where we can only see the left coronary artery. Some examples can be seen in Figure3.4 (e) and (f)..

(30) 3.2 Format of the data. 17. (a) view of a left coronary artery. (b) view of a left coronary artery. (c) view of a right coronary artery. (d) view of a right coronary artery. (e) view with right and left coronary arteries. (f) view with right and left coronary arteries. Figure 3.4: Example of echocardiography frames with artery view.

(31) 18. Chapter 3. Building the data set. (a) echocardiography not from plan B. (b) echocardiography not from plan B. (c) echocardiography from plan B. (d) echocardiography not from plan B. Figure 3.5: Example of echocardiography frames without any coronary artery view. We are going to classify together all frames that don’t feature coronary arteries no matter which part of the heart they represent. For example, all images of Figure 3.5 will be classified as NONE.. 3.3. Design of the data set. The labelling was done manually, scanning each DICOM file frame by frame and defining to which class the image belongs. It was done using an interface developed in python an designed to make the process of scanning and classifying all frames easier, using only the keyboard. The algorithm takes as an input the path to a DICOM file and will return a mask in the form of a csv file and png images classified in folders. When the program is launched, the user can navigate through the video using the keyboard arrows to display the previous or next frame. Then if the frame displayed represent a coronary artery, the user can label it by pressing the ’L’, ’R’ or ’B’ key to classify it respectively as ’LEFT’, ’RIGHT’ or ’BOTH’. When a frame is labeled, a colored circle appear on the bottom left.

(32) 3.3 Design of the data set. 19. of the image. Green when the image is classified as RIGHT, blue when it is classified as BOTH and red when its classified as LEFT. The classification can be deleted at any moment by pressing the ’N’ key and selecting an other classification. When the user consider the labelling as done he can exit the video by pressing the escape key, then a mask with the labels is created the frames are extracted from the "pixel array" field, converted in png image and distributed in folders according to its label. This distribution in folders, illustrated in Figure 3.6 will be used as label during the training process. The mask is a list of numbers of length equal to the number of frame in the echocardiography video. Each frame is assigned a number between 0 and 3 to define its class. (0 if the frame that represent both artery, 1 for left arteries, 2 for the frames that don’t represent any arteries and 3 for right arteries).. Figure 3.6: folder hierarchies for labeled images. The first problem encountered during the labelling process was the blurriness of echocardiography images, the arteries don’t always appear clearly and entirely. As the image is blurry depending on the size chosen to display the image the artery can appear more or less clearly. To make the classification as good as possible, the process was done 3 times and then the images classified with the same label in the three classifications where automatically classified as such, the images with different labels were re examined and labeled with the most used label. Then, to make sure the classification was correct the images were re examined by the cardiologists team of the 12 de Octubre hospital. At the end of the classification we have the data distribution described in Figure 3.7:. Figure 3.7: Distribution of classes in each DICOM file.

(33) 20. Chapter 3. Building the data set. Moreover, the number of echocardiography frames without coronary arteries is way bigger than the number of frames with either one or two coronary arteries. To have approximately the same number of images featuring an artery and images without artery we select randomly 1748 images without artery, 46 in each of the 38 files. This number is calculated from the total number of images from the three others class divided by the number of DICOM files. 973 + 263 + 535 = 1771 1771 = 46 38 46 × 38 = 1748 The final distribution of data between classes is shown in Figure 3.8 bellow and detailed more precisely in Table 3.1.. Figure 3.8: Distribution of classes. Once the data set is ready to be used we can start training neural network to detect frames with coronary arteries..

(34) 3.3 Design of the data set. 21. frame number. vew. LEFT. RIGHT. BOTH. NONE. IM 0017. 2. 0. 0. 0. 46. IM 0018. 1. 17. 0. 8. 46. IM 0019. 1. 52. 18. 17. 46. IM 0020. 1. 66. 18. 4. 46. IM 0021. 1. 53. 17. 23. 46. IM 0022. 1. 37. 10. 52. 46. IM 0023. 1. 53. 16. 21. 46. IM 0024. 2. 0. 0. 0. 46. IM 0026. 2. 0. 0. 0. 46. IM 0027. 1. 9. 0. 16. 46. IM 0028. 1. 45. 9. 24. 46. IM 0029. 1. 56. 14. 4. 46. IM 0030. 1. 44. 10. 37. 46. IM 0031. 1. 42. 11. 49. 46. IM 0032. 1. 31. 10. 27. 46. IM 0033. 2. 0. 0. 0. 46. IM 0034. 2. 0. 0. 0. 46. IM 0035. 2. 0. 0. 0. 46. IM 0036. 2. 0. 0. 0. 46. IM 0037. 2. 0. 0. 0. 46. IM 0054. 2. 0. 0. 0. 46. IM 0055. 1. 5. 0. 18. 46. IM 0056. 1. 41. 6. 23. 46. IM 0057. 1. 59. 13. 4. 46. IM 0058. 1. 49. 10. 26. 46. IM 0059. 1. 40. 9. 55. 46. IM 0060. 1. 40. 26. 16. 46. IM 0061. 2. 0. 0. 0. 46. IM 0062. 2. 0. 0. 0. 46. IM 0063. 2. 0. 0. 0. 46. IM 0064. 2. 0. 0. 0. 46. IM 0065. 1. 29. 0. 6. 46. IM 0082. 2. 0. 0. 0. 46. IM 0083. 1. 11. 1. 12. 46. IM 0084. 1. 41. 15. 21. 46. IM 0085. 1. 59. 18. 6. 46. IM 0086. 1. 50. 17. 25. 46. IM 0087. 1. 44. 15. 41. 46 TOTAL. NUMBER OF TRAINING DATA. 781. 219. 432. 1380. 2812. NUMBER OF TESTING DATA. 192. 44. 103. 368. 707. TOTAL. 973. 263. 535. 1748. 3519. Table 3.1: Resume of the data set.

(35) Chapter 4. Development of the system 4.1. Description of the system. The training process is split in two steps. First we will train networks to detect if the image features an artery. Once we obtain stable system we will adapt the networks to classify the images according to which kind of coronary artery can be seen. The objective of this classification is to later have an automatic selection of images featuring a coronary artery from an echocardiography video. The objective is to be able to give a DICOM file containing an echocardiography video to the system which will extract a collection of frame from the DICOM file and classify each frame with a classification algorithm developed in this project. The system distribute the frames classified in folders accessible by the user at the output of the system. The design of the system is illustrated in Figure 4.1.. Figure 4.1: organization of the final system classifying into two classes. To extract the frames from the DICOM file we use the python package "pydicom" that read DICOM files and give access to all the fields. As underlined in the first chapter of this document, we will only use three fields of the DICOM files: "pixel array" that contains the frames stored is an array of array of pixel values. This will allow us to extract the frames from the DICOM files and save them as png images. We also extract the dimension of the images from the fields "Rows" and "Column" and the scale of the image from the field "Pixel spacing" or "PhysicalDeltax" and "PhysicalDeltay". Even though this information is not necessary for the classification task, it is essential to be able to locate them as the final goal is to measure the width of the coronary arteries on the pictures to define if the patient is ill or not..

(36) 4.2 Detection of images with coronary arteries. 23. To classify, we use Convolutional Neural Networks, starting from a simple model to test the viability of the system and evolving toward a deeper and more complex one to capture as much features from the images as possible and perform a good classification.. 4.2. Detection of images with coronary arteries. For this task we use the data set of echocardiographies previously described. It has been designed so the two classes are balanced regarding the number of frames in each class, and so the test batch has the same proportion of each kind of frame as the training set. 4.2.1. Basic Convolutional Neural Networks. To start experimenting with classification systems a basic CNN was used composed of three convolutional layers (Convolution-Activation-MaxPooling). The model ends with fully connected layers lightened with a Dropout layer as described in Figure 4.2. The output is reduced to one neuron and the classification is done using a step function over 0.5. The training process lasts 50 epochs using a RMSprop optimizer and a learning rate of 1e-4.. Figure 4.2: first basic CNN model architecture. This network was very unstable and after training 10 models from the same architecture only about half of them would converge and learn from the data in the range of 50 epochs. Even among the converging models it would sometime take tens of epochs before the accuracy and lost function start evolving in the correct direction. Figure 4.3 shows the evolution of accuracy during three training tentative of this basic CNN model.. (a) basic 3 layers CNN not learning. (b) basic 3 layers CNN start learning after 20 epochs. (c) basic 3 layers CNN start learning correctly. Figure 4.3: Example of learning behavior of the CNN with 3 layers.

(37) 24. Chapter 4. Development of the system. The random results could come from the weight initialization, when it is too far from the optimal weights. Since, when the system is learning, the result concerning accuracy is quite good we are going to keep up with a similar architecture. The first intent to have a more stable result was to crop the images around the area of interest (Figure 4.4). This solution has the double advantage to reduce the dimension of the network, therefore reducing memory issues and focusing directly on the interesting area discarding the process of useless pixels. The images are cropped around the area of interest reducing their size by a ratio of 2.5 going from images of size 800x600 pixels to images 500x320 pixels. This should make the model more efficient and reduce the computational cost of each operation. Indeed, as we plan on implementing deep neural network, optimizing the computational cost of the network will be important.. (a) full frame. (b) cropped frame. Figure 4.4: comparison between a full frame and a cropped frame. The model using cropped images is also trained on 50 epochs, with a RMSprop optimizer and a learning rate of 1e-4. It gave slightly better results, as on a sample of 10 training process, 8 of them where converging in less than 10 epochs. To make the system even more stable we add more layers to the network. Keeping the same architecture as in the previous basic CNN (Figure 4.2) we add two convolutional layers to get a model with the architecture described in Figure 4.5.. Figure 4.5: CNN model with 5 convolutional layers.

(38) 4.2 Detection of images with coronary arteries. 25. Keeping the same parameters as before, we train the model during 50 epochs using a RMSprop optimizer and a learning rate of 1e-4. The results are described in Figure 4.6.. (a) evolution of accuracy. (b) evolution loss function. Figure 4.6: 5 layers CNN model after 50 epochs, RMSprop optimizer and 1e-4 learning rate. 0: no artery, 1: visible arteries Figure 4.7: confusion matrix of 5 layers CNN after 50 epochs using RMSprop optimizer and 1e-4 learning rate. This time the network converges systematically and on the training data set we have a final accuracy around 91% and the loss converges toward 0.2. This model applied to the test data set gives an accuracy of 92%. We can see on Figure 4.7 that the model detects more accurately the frames featuring at least one coronary artery than the frames with nothing. This is interesting for our model since the final objective is to extract frames with coronary arteries, here 96% of the frames extracted will be exploitable. Trying to improve the model, we have performed some test with other classic optimizer. First we try to train the network with an SGD optimizer, still during 50 epochs with a 1e-4 learning rate.. (a) evolution of accuracy. (b) evolution loss function. Figure 4.8: result of 5 layers CNN model after 50 epochs using a SGD optimizer and 1e-4 learning rate.

(39) 26. Chapter 4. Development of the system. 0: no artery, 1: visible arteries Figure 4.9: confusion matrix of 5 layers CNN model after 50 epochs using a SGD optimizer and 1e-4 learning rate. For this configuration we don’t obtain better results. Here, the accuracy on validation data converges toward 85% when it was closer to 90% in the previous model and the validation loss saturate a little below 0.4 when it was converging to 0.3 with RMSprop optimizer (see Figure 4.8 and Figure 4.9). The final accuracy on the test data set is also worse, around 89% against 92% previously. The main difference is that more mistakes are made on the prediction of class 0. Then we try the Adam optimizer, which is the one recommended by Keras for the implementation of a CNN model. We train the model during 50 epochs with a 1e-4 learning rate and obtain the accuracy and loss curves in Figure 4.10.. (a) evolution of accuracy. (b) evolution loss function. Figure 4.10: result of 5 layers CNN model after 50 epochs using an Adam optimizer and 1e-4 learning rate. 0: no artery, 1: visible arteries Figure 4.11: confusion matrix of 5 layers CNN model after 50 epochs using an Adam optimizer and 1e-4 learning rate..

(40) 4.2 Detection of images with coronary arteries. 27. The confusion matrix Figure 4.11 shows that the result is unchanged regarding the amount of correct labeling for class 0, but is little bit better for the prediction of class 1. Since the evolution of the accuracy and the loss function don’t seem to have fully converge (Figure 4.10), we implement the same model (Adam optimizer and 1e-4 learning rate) with 100 epochs and obtain the accuracy and loss curves of Figure 4.12.. (a) evolution of accuracy. (b) evolution loss function. Figure 4.12: result of 5 layers CNN model after 100 epochs using an Adam optimizer and 1e-4 learning rate. 0: no artery, 1: visible arteries Figure 4.13: confusion matrix of 5 layers CNN model after 100 epochs using an Adam optimizer and 1e-4 learning rate. After 100 epoch using an Adam optimizer we have an overall accuracy on the test batch of over 96% which is the best score obtained so far. The confusion matrix on Figure 4.13 shows that even though the detection of frames featuring a coronary artery (class 1) didn’t improve regarding the previous model, and is even a bit less good, the classification of images from the class 0 is more accurate and both classes are correctly classified 90% of the time or more. Finally we lose some accuracy regarding the classification of class 1 but we gain in balance regarding the overall performance of the algorithm..

(41) 28. Chapter 4. Development of the system. The results of the action of different optimizer on the accuracy are summed up on Table 4.1 optimizer. learning rate. number of epoch. test batch overall accuracy. RMSprop. 1e-4. 50. 0.92. SGD. 1e-4. 50. 0.89. Adam. 1e-4. 50. 0.89. Adam. 1e-4. 100. 0.93. Table 4.1: Summary of optimizer performance. To conclude, we have seen in this section that to have a model with a stable learning capacity we had to go from a model with three layers to a slightly deeper model with five layers. We also found out that reducing the size of the images allowed us to have a more stable and reliable model while reducing the needed memory and computational power. To design an optimal 5 layers CNN model we experimented with three of the most common optimizer and defines the Adam one as giving the best result for our application achieving an overall accuracy of 93% on the test batch and more specifically, detecting 96% of images featuring a coronary artery. 4.2.2. VGG. Here we are implementing VGG16 network. This form has been preferred to the more common VGG19 because during the development of the networks, we had to deal with memory limitation problems and therefore we tried to reduce as much as possible the computation cost the algorithm. Reducing the size of the images was a first step to solve this issue, using small batches of 5 frames and to reduce the number of parameters in the model also help to deal with memory issues. Since we only needed to distribute the images among 2 classes, we also reduced drastically the number of neurons in the last fully connected layers.. Figure 4.14: architecture of our VGG16 model.

(42) 4.2 Detection of images with coronary arteries. 29. We keep the same hyper parameters as before to train the model, using the optimizer with the best performance. So the model is trained during 50 epochs using an Adam optimizer and a 1e-4 learning rate.. (a) training data set. (b) validation data set. Figure 4.15: evolution of accuracy on VGG16 model during 50 epochs using an Adam optimizer and a 1e-4 learning rate. (a) training data set. (b) validation data set. Figure 4.16: evolution of loss on VGG16 model during 50 epochs using an Adam optimizer and a 1e-4 learning rate. 0: no artery, 1: visible arteries Figure 4.17: confusion matrix of VGG16 model after 50 epochs using an Adam optimizer and a 1e-4 learning rate. The results of the VGG16 model (Figures 4.15 and 4.15) are very similar to the results of the previous models. We achieve an overall accuracy on the test batch of 90% get the best classification performance on class 1 (Figure 4.17. Here adding more layers doesn’t improve the quality of the classification maybe because the images are quite normalized and the structure of the pattern researched is not complex. What makes the classification.

(43) 30. Chapter 4. Development of the system. difficult is more the low quality of some images and the blurry border between an image where we can clearly see a coronary artery and an image where you can only guess it but not clearly identify its shape.. 4.3. Classification of coronary artery views. The first step was to design a model to do a binary classification, to detect frames with coronary arteries. Now that we have found several stable models that fulfill the previous objective we are going to extend those model to distribute the frames into four classes according to which kind of coronary arteries can be seen in the image. We are going to use the same architectures modifying the last layers to have as an output a vector of length 4, defining the probability for the image to belong to each class. In the previous problem the data set was more or less equally divided between the two classes. Now that we make a difference between the different kind of view. "BOTH" for the images representing both coronary arteries, "LEFT" representing only the left coronary artery, "NONE" not representing any coronary artery and "RIGHT" representing the right coronary artery. This distribution is not balanced and the less represented class has six time less images than the most represented one. To reduce the impact of this inequalities one common solution is to apply data augmentation. However this process cannot be applied easily with medical images especially for this particular application, the images cannot be rotated or flipped since the capture of the echocardiography are always done with the same orientation. Here the solution chosen is to apply weight to each class during the training process. For example, this will give to each instance of class "RIGHT" six times the importance of an instance of class "NONE".. 4.3.1. Basic Convolutional Neural Networks. We start by adapting the 5 layers CNN to this classification, testing again several optimizer. The first model is trained during 50 epochs using the RMSprop optimizer with a learning rate of 1e-4. The accuracy and loss curves are describes in Figure 4.18. (a) evolution of accuracy. (b) evolution loss function. Figure 4.18: result of 5 layers CNN model after 50 epochs using a RMSprop optimizer and 1e-4 learning rate to classify 4 classes.

(44) 4.3 Classification of coronary artery views. 31. 0: BOTH, 1:LEFT, 2:NONE, 3:RIGHT Figure 4.19: confusion matrix of 5 layers CNN model after 50 epochs using a RMSprop optimizer and 1e-4 learning rate to classify 4 classes. The application of this model on the test set gives an overall accuracy of 86%. As expected the overall accuracy is lower than for the previous binary classification with the same model. We can see on the confusion matrix Figure 4.19 that the classes with best accuracy is the "NONE" class and the class with the worst accuracy is the "RIGHT" class. The difference of performance among classes are explained by the unbalanced characteristic of the data set as the accuracy values matches with the amount of images in each class. We also applied the 5 layers CNN model with the Adam optimizer to this new classification task using directly 100 epochs and a 1e-4 learning rate. The results are described bellow in Figure 4.20. (a) evolution of accuracy. (b) evolution loss function. Figure 4.20: result of 5 layers CNN model after 100 epochs using a Adam optimizer and 1e-4 learning rate to classify 4 classes. 0: BOTH, 1:LEFT, 2:NONE, 3:RIGHT Figure 4.21: confusion matrix of 5 layers CNN model after 100 epochs using a Adam optimizer and 1e-4 learning rate to classify 4 classes.

(45) 32. Chapter 4. Development of the system. This classification gives especially good results on the test batch with an overall accuracy of 95% and classify almost perfectly the images without coronary arteries (class NONE). Even the least represented class, "RIGHT" has an accuracy of 85%. We have even better results than with the binary classification task.. 4.3.2. VGG. We also apply VGG16 model to this classification task to see if a deeper neural network can improve the results. We train the model on 50 epochs using an Adam optimizer and a 1e-4 learning rate. As the model take several hours to train completely we save the model with the best loss value. That’s the results we present bellow in the confusion matrix. Because it takes several hours, the training process was done in two sessions. In the graph bellow, the firsts 40 epochs are in pink. The second training session was trained on 27 epoch, starting from the weights of the 23rd epoch, which is the one with the lowest validation loss, that’s why the second session lasts 27 epochs and not 10 as is could have to have a total of 50 epochs. Because all the weights after the 23rd epoch are not used.. (a) training data set. (b) validation data set. Figure 4.22: evolution of accuracy on VGG16 model during 50 epochs using a Adam optimizer and 1e-4 learning rate to classify 4 classes. (a) training data set. (b) validation data set. Figure 4.23: evolution of loss on VGG16 model during 50 epochs using a Adam optimizer and 1e-4 learning rate to classify 4 classes.

(46) 4.3 Classification of coronary artery views. 33. 0: BOTH, 1:LEFT, 2:NONE, 3:RIGHT Figure 4.24: confusion matrix of VGG16 model after 68 epochs using a Adam optimizer and 1e-4 learning rate to classify 4 classes.

(47) 34. Chapter 4. Development of the system. Finally, even though the results presented in Figures 4.23a and 4.23 shows that the system is learning, we can see on the confusion matrix Figure 4.24 that this model doesn’t improve the classification accuracy and get even worst results than the 5layers CNN with an RMSprop optimizer in every class but LEFT. Moreover, the 5 layers CNN with an Adam optimizer give strictly better result for a way shorter training delay. A reason to this sub performance of the VGG16 model can be due to the use of Batch Normalization after each convolution layer. Indeed, to solve memory issues we use batches of 5 images, normalizing the result on such small batches introduce bias especially if the images on the batch are correlated. Batch Normalization center the weight value so the mean and standard deviation is always the same but if the mini batch has a mean and standard deviation too different from the rest of the data-set, the normalization introduces bias.. 4.3.3. Resnet50. We trained a ResNet50 model during 50 epochs using the Keras implementation of ResNet, an Adam optimizer and a learning rate of 1e-5. With a higher learning rate, the model was not learning anything and classified every images in the same class. Since the training is long it had to be split in three sessions. Each session was trained with the last weights of the previous session. The first session is represented in grey on the graphs lasted 12 epochs, the second session in blue lasted 21 epochs and last session is in green lasted 19 epochs. THe graph have been slightly smoothed on tensorboard to get a better view of each curve.. (a) training data set. (b) validation data set. Figure 4.25: evolution of accuracy on ResNet50 model during 52 epochs using an Adam optimizer and a 1e-5 learning rate to classify 4 classes.

(48) 4.3 Classification of coronary artery views. (a) training data set. 35. (b) validation data set. Figure 4.26: evolution of loss on ResNet50 model during 52 epochs using an Adam optimizer and a 1e-5 learning rate to classify 4 classes. 0: BOTH, 1:LEFT, 2:NONE, 3:RIGHT Figure 4.27: confusion matrix of ResNet50 model after 52 epochs using an Adam optimizer and a 1e-5 learning rate to classify 4 classes. Both the training accuracy (Figure 4.25) and training loss curves (Figure 4.26) are stable, the validation results are more chaotic but it still seems like the model is learning. Using this model to classify images from the test batch we get an overall accuracy of 77%. This result, although bellow the best validation accuracy, stays quite consistent. However, the confusion matrix Figure 4.27 shows that the model completely ignored the RIGHT class and classify pretty badly the BOTH class. The two others class obtain rather good results with an accuracy of 90% or over. This can be explained by the unbalance characteristic of the data set.. 4.3.4. Resnet50 with transfer learning. As seen in the state of the art, since building a data set with enough images can be hard when working with medical images, using transfer learning is common in medical application of machine learning. Since the ResNet50 gave mixed but promising results regarding the performance on the dominant classes, we used transfer learning to improve our results. To do so we still use the Keras implementation of ResNet50, but we start with weighted trained on ImageNet data set. ImageNet uses colored images so we repeat the grey scale images three time to start with three dimension tensors..

(49) 36. Chapter 4. Development of the system. (a) training data set. (b) validation data set. Figure 4.28: evolution of loss on ResNet50 using tranfert learning from a model trained on ImageNet dataset. The model is trained during 52 epochs using an Adam optimizer and a 1e-5 learning rate to classify 4 classes. 0: BOTH, 1:LEFT, 2:NONE, 3:RIGHT Figure 4.29: confusion matrix of loss on ResNet50 using tranfert learning from a model trained on ImageNet dataset. The model is trained during 52 epochs using an Adam optimizer and a 1e-5 learning rate to classify 4 classes. For this classification task, using transfer learning does improve the performance of the ResNet50 model. The model is trained up to an accuracy around 96% (Figure 4.28a) against less than 88% when not using transfer learning and the loss function decrease lower than 0.1 during the training process (Figure 4.28b), when the original ResNet50 model only achieve to lower the loss function to 0.25. However, we can observe that the validation loss function (in orange on the Figure 4.28b) doesn’t decrease denoting maybe an over fitting problem. Despite of this observation, on the testing set this model get an overall accuracy of 86% and the confusion matrix Figure 4.29, shows it classify the images in the four classes. The accuracy of NONE and LEFT classes that were already good in the previous model stays over 85% in this model. Even though the LEFT class has lost 7% of accuracy, only 27% of all images were classified as LEFT against almost 40% in the previous model. Therefore with transfer learning the model is more precise..

(50) 4.3 Classification of coronary artery views. 37. model. transfert learning. optimizer. learning rate. number of epochs. test accuracy. 5 layers CNN. no. RMSprop. 1e-4. 50. 0.86. 5 layers CNN. no. Adam. 1e-4. 100. 0.95. VGG16. no. Adam. 1e-4. 50. 0.90. ResNet50. no. Adam. 1e-5. 52. 0.77. ResNet50. no. Adam. 1e-5. 50. 0.86. Table 4.2: Comparison of different models on 4 classes classification. To conclude on this classification task, we can see that the basic 5 layers CNN using an Adam optimizer gives the best result, underlying once again that a very deep network is not the best solution for this particular application of image classification. We’ve also seen the benefit of transfert learning on deep CNN such as ResNet50 observing an important amelioration of the model performance when trained starting with weights trained on the ImageNet data set. After the same number of epochs and under the same conditions, the network gain almost 10% of accuracy on the test batch. (Table 4.2).

(51) Chapter 5. Conclusion and future lines 5.1. Conclusion on the results. Finally the previous experiences shows that in our case the simplest networks give better results and adding more layers don’t mean getting more accuracy in the classification. Especially we’ve seen that the best results were obtained with a CNN composed of only 5 convolutional layers. Playing with different parameter around the optimizer underlined that the Adam optimizer seems to be the most adapted to our problem giving better results both for the binary classification and four classes distribution. We can wonder why such a simple model perform better than deeper networks, known for their efficiency regarding images classification. In addition to the discussion on Batch Normalization we can find elements of explanation in the construction of the data set. Beyond the obvious problems of mis labeling due to the subjective limit between images with and without coronary arteries along with the fastidious aspect of such classification that lead inevitably to classification mistake, the main critic that can be made about this data set is the normalization of its images. Indeed, even though we have images of different quality they were all acquired in the same condition with the same device, using the same scale. Therefore all images of the data set are highly correlated and the good results obtained are most certainly partially due to over fitting. The first step toward a more honest data set would be to perform data augmentation to have images of coronary arteries represented on different scales. Off course adding images from others DICOM files, acquired with others devices with different orientation and class would be the best solution. An other big flaw of the data set is the highly unbalance distribution of data among classes, adding images to the least represented classes would help getting more uniform accuracy. However, those improvement have a cost and the data set we’ve used give us a good idea of the result we can expect from such a research and attest that it is interesting to look toward machine learning to solve classification problems using medical images and great result can be expected with a more elaborated data set..

(52) 5.2 Next steps. 5.2. 39. Next steps. To go further on this project the next step would be to design a model to do image segmentation to identify the coronary arteries on each images. This task require to built a data set of images where the arteries have been manually segmented. It is a very long task and we did not had enough time to complete it. Segmentation would give a more direct access to the coronary artery characteristics such as its width which is the crucial information to diagnose Kawasaki disease..

(53) Bibliography [1] Kawasaki disease. American Heart Association, May 2017. Available at: https: //www.heart.org/en/health-topics/kawasaki-disease, last consulted on 17 June,2019. [2] Andrew Zisserman Karen Simonyan. Very deep convolutional networks for large-scale image recognition. September, 2014. [3] Shaoqing Ren Jian Sun Kaiming He, Xiangyu Zhang. Deep residual learning for image recognition. December, 2015. [4] Quoc V.Le Mingxing Tan. Efficientnet: Rethinking model scaling for convolutional neural networks. May, 2019. [5] Quoc V. Le Mingxing Tan. Efficientnet: Improving accuracy and efficiency through automl and model scaling. Google AI blog, 29 May, 2019. Available at: https://ai. googleblog.com/2019/05/efficientnet-improving-accuracy-and. html,last consulted on 10 June,2019. [6] Rob Novoa Justin Ko Susan M. Swetter Helen M. Blau Sebastian Thrun Andre Esteva, Brett Kuprel. Dermatologist-level classification of skin cancer with deep neural networks. Nature, February, 2017. Available at: https://cs.stanford.edu/ people/esteva/nature/, last consulted on 12 June,2019.. [7] Rostyslav Demush. A brief history of computer vision (and convolutional neural networks). Hackernoon, February 2019. Available at: https://hackernoon.com/ a-brief-history-of-computer-vision-and-convolutional-neural-networks-8 last consulted on 17 June,2019. [8] Deep learning vs. machine learning vs. pattern recognition. Alibabacloud blog, September 2017. Available at: https://www.alibabacloud.com/blog/ deep-learning-vs-machine-learning-vs-pattern-recognition_ 207110,last consulted on 18 June,2019.. [9] Siddharth Das. Cnn architectures: Lenet, alexnet, vgg, googlenet, resnet and more . . . . Medium, November 2017. Available at: https://medium.com/@sidereal/ cnns-architectures-lenet-alexnet-vgg-googlenet-resnet-and-more-6660914 last consulted on 17 June,2019..

(54) BIBLIOGRAPHY. 41. [10] Federico Peccia. Batch normalization: theory and how to use it with tensorflow. Medium, September 2018. Available at: https://towardsdatascience.com/ batch-normalization-theory-and-how-to-use-it-with-tensorflow-1892ca017 consulted on 17 June,2019.. [11] Sumit Saha. A comprehensive guide to convolutional neural networks. Towards Data Science, 15 December, 2018. Available at: https://towardsdatascience.com/ a-comprehensive-guide-to-convolutional-neural-networks-the-eli5-way-3b last consulted on 10 June,2019. [12] Geoffrey E. Hinton Alex Krizhevsky, Ilya Sutskever. Imagenet classification with deep convolutionalneural networks. 2012. [13] Alex krizhevsky. Google Scholar. Available at: https://scholar.google.com/ citations?user=xegzhJcAAAAJ&hl=fr, last consulted on 11 June,2019. [14] Xin Yang Baiying Leia Li Liu Shawn Xiang Li Dong Ni Tianfu Wang Shengfeng Liua, Yi Wanga. Deep learning in medical ultrasound analysis: A review. Science Direct, April, 2019. Available at: https://www.sciencedirect.com/science/ article/pii/S2095809918301887, last consulted on 11 June,2019. [15] Ed Corbett. the real world benefits of machine learning in health care. HealthCatalyst, April, 2017. Available at: https://www.healthcatalyst.com/ clinical-applications-of-machine-learning-in-healthcare, last consulted on 11 June,2019. [16] Margaret Guo Tony Lindsey Carson Lam, Darvin Yi. Automated detection of diabetic retinopathy using deep learning. May, 2018. [17] Anshul Tibrewal Mohammad R. K. Mofrad Ali Madani, Jia Rui Ong. Deep echocardiography: data-efficient supervised and semi-supervised deep learning towards automated diagnosis of cardiac disease. Nature, October 2018. Available at: https://www.nature.com/articles/s41746-018-0065-x,last consulted on 18 June,2019. [18] European commission. Ethics and data protection. November, 2018. Available at: http://ec.europa.eu/research/participants/data/ref/h2020/ grants_manual/hi/ethics/h2020_hi_ethics-data-protection_en. pdf. [19] Daniel Faggella. Machine learning healthcare applications – 2018 and beyond. Emerj, last updated: May, 2019. Available at: https://emerj.com/ai-sector-overviews/ machine-learning-healthcare-applications/, last consulted on 12 June,2019..

(55) 42. BIBLIOGRAPHY. [20] Sebastian Ruder. An overview of gradient descent optimization algorithms. Sebastian Ruder’s website, January 2016. Available at: http://ruder.io/ optimizing-gradient-descent/index.html#adam, last consulted on 13 June,2019. [21] Timon Ruban. A refresher on batch (re-)normalizationg. January, 2018. Available at: https://medium.com/luminovo/ a-refresher-on-batch-re-normalization-5e0a1e902960..

(56)

(57) UNIVERSIDAD POLITÉCNICA DE MADRID ESCUELA TÉCNICA SUPERIOR DE INGENIEROS DE TELECOMUNICACIÓN.

(58)