Segmentation of mitochondria in Serial Section Electron Microscopy images of the brain using Deep Neural Networks

Texto completo

(1)Escuela Técnica Superior de Ingenieros Informáticos Universidad Politécnica de Madrid. Segmentation of Mitochondria in Serial Section Electron Microscopy Images of the Brain using Deep Neural Networks. Trabajo Fin de Máster Máster Universitario en Inteligencia Artificial. AUTOR: Enrique Rebollo Garcı́a TUTOR: Luis Baumela Molina. 2018.

(2)

(3) i. AGRADECIMIENTOS A mi profesor Luis Baumela, tutor de esta tésis. A mis padres. Al pueblo de Portman..

(4) ii.

(5) iii. SUMMARY This master thesis approaches the problem of image segmentation by using Deep Neural Networks. Recently applied to a wide variety of problems, these networks have surpassed the previous state-of-the-art performance in fields like computer vision, natural language processing or audio analysis. In particular, our work focuses on the segmentation of mitochondria in Serial Section Electron Microscopy images of the brain. Segmentation is a highly relevant task in medical image analysis, as automatic delineation of organs and structures of interest is often necessary to perform computer assisted diagnosis. All the experiments performed make use of the public Electron Microscopy Dataset, a representation of a section taken from the CA1 hippocampus region of the brain. To this end, it is proposed a new encoder-decoder architecture on which the performance of various loss functions is studied. This network is basically a simplified version of others architectures presented previously in literature, achieving results close to the state-of-the-art over the same dataset used for this study. The results obtained provide evidence that those loss functions which take into account the class imbalance problem perform much better than those which do not take into account the class distribution. This document also depicts the current state-of-the-art of deep learning architectures and optimization techniques for image segmentation..

(6) iv.

(7) Contents. v. Contents 1. INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 1. 1.1. Contextual Framework . . . . . . . . . . . . . . . . . . . . . . . . . .. 1. 1.2. Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 4. 1.3. Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 5. 2. FUNDAMENTALS OF IMAGE SEGMENTATION . . . . . . . . . .. 7. 2.1. Region-based Segmentation . . . . . . . . . . . . . . . . . . . . . . .. 8. 2.1.1. Threshold . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 8. 2.1.2. Regional Growth . . . . . . . . . . . . . . . . . . . . . . . . .. 9. Edge Detection Segmentation . . . . . . . . . . . . . . . . . . . . . .. 9. 2.2.1. Sobel Operator . . . . . . . . . . . . . . . . . . . . . . . . . .. 9. 2.2.2. Laplacian Operator . . . . . . . . . . . . . . . . . . . . . . . . 10. 2.2. 2.3. 2.4. Active Contours . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.3.1. Snakes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11. 2.3.2. Intelligent Scissors . . . . . . . . . . . . . . . . . . . . . . . . 13. 2.3.3. Level Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13. Clustering Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.4.1. Watershed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14. 2.4.2. K-means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15. 2.4.3. Mixtures of Gaussians . . . . . . . . . . . . . . . . . . . . . . 15. 2.4.4. Mean-Shift. 2.4.5. Normalized Graph Cuts . . . . . . . . . . . . . . . . . . . . . 17. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16.

(8) vi. Contents. 2.5. Random Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.5.1. Markov Random Fields . . . . . . . . . . . . . . . . . . . . . . 19. 2.5.2. Conditional Random Fields . . . . . . . . . . . . . . . . . . . 20. 2.6. Support Vector Machines . . . . . . . . . . . . . . . . . . . . . . . . . 21. 2.7. Metrics for Semantic Segmentation . . . . . . . . . . . . . . . . . . . 23 2.7.1. Pair-counting metrics . . . . . . . . . . . . . . . . . . . . . . . 24. 2.7.2. Information Theory metrics . . . . . . . . . . . . . . . . . . . 25. 2.7.3. Overlap-based metrics . . . . . . . . . . . . . . . . . . . . . . 25. 2.7.4. Volume-based metrics. . . . . . . . . . . . . . . . . . . . . . . 26. 3. DEEP LEARNING APPROACHES TO IMAGE SEGMENTATION. 27. 3.1. Transposed Convolution . . . . . . . . . . . . . . . . . . . . . . . . . 29. 3.2. Encoder-Decoder architectures . . . . . . . . . . . . . . . . . . . . . . 31. 3.3. Unpooling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32. 3.4. Strided Convolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34. 3.5. Dilated Convolution . . . . . . . . . . . . . . . . . . . . . . . . . . . 35. 3.6. Large Kernel Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36. 3.7. Residual Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37. 3.8. Densely Connected Convolutional Networks . . . . . . . . . . . . . . 39. 3.9. FCN joint with traditional methods . . . . . . . . . . . . . . . . . . . 40 3.9.1. Conditional Random Field . . . . . . . . . . . . . . . . . . . . 40. 3.9.2. Compressed Sensing . . . . . . . . . . . . . . . . . . . . . . . 40. 3.10 Pyramid Representation . . . . . . . . . . . . . . . . . . . . . . . . . 41.

(9) Contents. vii. 3.10.1 Pyramid Pooling . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.10.2 Atrous Spatial Pyramid Pooling . . . . . . . . . . . . . . . . . 42 3.10.3 Volumetric Networks . . . . . . . . . . . . . . . . . . . . . . . 43 3.11 Loss Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.11.1 Cross-Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 3.11.2 Intersection-over-Union . . . . . . . . . . . . . . . . . . . . . . 46 3.11.3 Lovász-Hinge . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 3.11.4 Dice Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 4. EXPERIMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49. 4.1. Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49. 4.2. Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49. 4.3. Previous work on the EM dataset . . . . . . . . . . . . . . . . . . . . 50. 4.4. Proposed architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 52. 4.5. Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 4.5.1. Data Augmentation . . . . . . . . . . . . . . . . . . . . . . . . 55. 4.5.2. Minibatch structure . . . . . . . . . . . . . . . . . . . . . . . . 61. 4.5.3. Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62. 4.5.4. Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . 62. 4.5.5. Optimizer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63. 4.5.6. Learning Rate . . . . . . . . . . . . . . . . . . . . . . . . . . . 64. 4.5.7. Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65. 4.5.8. Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65.

(10) viii. Contents. 4.6. Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 4.6.1. Cross-Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . 67. 4.6.2. Intersection over Union . . . . . . . . . . . . . . . . . . . . . . 69. 4.6.3. Lovász - Hinge . . . . . . . . . . . . . . . . . . . . . . . . . . 71. 4.6.4. Dice coefficient . . . . . . . . . . . . . . . . . . . . . . . . . . 73. 4.7. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75. 5. FUTURE WORK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79.

(11) List of Figures. ix. List of Figures 1. Output of different segmentation models. . . . . . . . . . . . . . . . .. 7. 2. Image thresholded using OTSU algorithm. . . . . . . . . . . . . . . .. 8. 3. Sobel operator segmentation. . . . . . . . . . . . . . . . . . . . . . . . 10. 4. Laplacian operator segmentation. . . . . . . . . . . . . . . . . . . . . 11. 5. Confusion Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23. 6. AlexNet CNN architecture. . . . . . . . . . . . . . . . . . . . . . . . . 27. 7. VGG-16 architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . 28. 8. Inception module with dimension reductions. . . . . . . . . . . . . . . 28. 9. Deconvolution operation. . . . . . . . . . . . . . . . . . . . . . . . . . 30. 10. Convolution layers enable a classification network to output a heatmap. 30. 11. U-net architecture (example for 32x32 pixels in the lowest resolution). 12. SegNet architecture is fully convolutional. . . . . . . . . . . . . . . . 32. 13. Unpooling operation. . . . . . . . . . . . . . . . . . . . . . . . . . . . 33. 14. Architecture of the proposed deconvolutional network. . . . . . . . . . 33. 15. Segnet decoder upsampling compared to FCN. . . . . . . . . . . . . . 33. 16. Convolution operation with filter size 3 and stride 2.. 17. Systematic dilation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35. 18. GCN addresses both the classification and localization issues. . . . . . 36. 19. Details of CGN module. . . . . . . . . . . . . . . . . . . . . . . . . . 36. 20. Residual learning building block. . . . . . . . . . . . . . . . . . . . . 37. 21. Building blocks for different residual networks. . . . . . . . . . . . . . 38. 31. . . . . . . . . . 34.

(12) x. List of Figures. 22. 5-layer dense block with a growth rate of k = 4. . . . . . . . . . . . . 39. 23. Overview of proposed PSPNet. . . . . . . . . . . . . . . . . . . . . . 42. 24. Auxiliary loss in ResNet101. . . . . . . . . . . . . . . . . . . . . . . . 42. 25. ASPP employs multiple parallel filters with different rates. . . . . . . 43. 26. Convolution operation in 3D. . . . . . . . . . . . . . . . . . . . . . . 43. 27. Proposed 3D CNN with residual blocks. . . . . . . . . . . . . . . . . 44. 28. Lovász Hinge in the case of two-pixel prediction. . . . . . . . . . . . . 48. 29. Proposed encoder-decoder architecture. . . . . . . . . . . . . . . . . . 52. 30. Contracting path of proposed architecture. . . . . . . . . . . . . . . . 53. 31. Expansive path of proposed architecture. . . . . . . . . . . . . . . . . 54. 32. Training example annotated for mitochondria segmentation. . . . . . 55. 33. Data augmentation on one single training example. . . . . . . . . . . 57. 34. Data augmentation on one single training example (ground truth). . . 58. 35. Google Cloud Platform logo. . . . . . . . . . . . . . . . . . . . . . . . 65. 36. Python logo. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65. 37. TensorFlow logo. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65. 38. Intel AVX and SSE data types. . . . . . . . . . . . . . . . . . . . . . 66. 39. Best outcome - testing slice 164. . . . . . . . . . . . . . . . . . . . . . 68. 40. Worst outcome - testing slice 34.. 41. Best outcome - testing slice 163. . . . . . . . . . . . . . . . . . . . . . 70. 42. Worst outcome - testing slice 34.. 43. Best outcome - testing slice 163. . . . . . . . . . . . . . . . . . . . . . 72. . . . . . . . . . . . . . . . . . . . . 68. . . . . . . . . . . . . . . . . . . . . 70.

(13) List of Figures. xi. 44. Worst outcome - testing slice 20.. . . . . . . . . . . . . . . . . . . . . 72. 45. Best outcome - testing slice 163. . . . . . . . . . . . . . . . . . . . . . 74. 46. Worst outcome - testing slice 20.. 47. Outcome for testing slice 34. . . . . . . . . . . . . . . . . . . . . . . . 76. 48. Top1 vs operations / num. parameters. . . . . . . . . . . . . . . . . . 77. . . . . . . . . . . . . . . . . . . . . 74.

(14) xii. List of Figures.

(15) List of Tables. xiii. List of Tables 1. Results of previous works on the EM dataset. . . . . . . . . . . . . . 51. 2. Transformations applied on every training example. . . . . . . . . . . 56. 3. Augmented dataset structure. . . . . . . . . . . . . . . . . . . . . . . 59. 4. Different minibatch structures. . . . . . . . . . . . . . . . . . . . . . . 61. 5. Cross-Entropy loss best and worst results on testing set. . . . . . . . 68. 6. IoU loss best and worst results on testing set. . . . . . . . . . . . . . 70. 7. Lovász loss best and worst results on testing set. . . . . . . . . . . . . 72. 8. Dice loss best and worst results on testing set. . . . . . . . . . . . . . 74. 9. Jaccard index achieved with each loss function on the testing set. . . 76.

(16) 1. 1. INTRODUCTION. This master thesis studies the use of Deep Neural Networks for semantic segmentation tasks. In particular, it focuses on the segmentation of mitochondria from images of the brain obtained by Serial Section Electron Microscopy. Mitochondrial abnormalities are involved in aging and disease, including neurodegeneration, cancer, and metabolic disorders. To this end, it is proposed a new encoder-decoder architecture on which the performance of various loss functions is studied. The results obtained provide evidence that those loss functions which take into account the class imbalance problem perform much better than those which do not take into account the class distribution. Segmentation can be defined as the process of decomposing an image into regions which are homogeneous according to some criteria. Internal properties of the region help to identify it, while its external properties (inclusion, adjacency, ...) are used to group regions. This is a highly relevant task in medical image analysis, as automatic delineation of organs and structures of interest is often necessary to perform computer assisted diagnosis. Besides its medical application, semantic segmentation is a key concept to achieve complete scene understanding. Deep Neural Networks have been recently applied to a wide variety of problems, surpassing the previous state-of-the-art performance in fields like computer vision, natural language processing or audio analysis. These models did not receive much attention until latest years, mainly because it was not possible to train them successfully using the standard back-propagation algorithms employed for shallow networks. Convolutional Neural Networks (CNNs or ConvNets) are a class of deep feedforward artificial neural networks based on the convolution operator. Neurons in a CNN present a connectivity pattern that resembles the organization of the animal visual cortex, the region of the brain responsible for the actual interpretation of what we see. ConvNets constitute the current state-of-the-art for semantic segmentation, as well as other computer vision tasks like object detection, instance segmentation, face recognition... In this thesis, ConvNets are the basic building block of our proposed neural architecture.. 1.1. Contextual Framework. The human brain is an amazing and powerful tool. It allows us to perceive, remember, understand, learn, communicate, ...; only requiring 20 watts of power to operate. Humankind has always been fascinated about it and understanding how it works is yet one of the big challenges of the new century..

(17) 2. 1 INTRODUCTION. Recently, the term Brain Sciences has been coined to denote all disciplines that contribute to our understanding of the human brain, including neuroscience, cognitive science, brain imaging, psychology, and other neural and behavioral sciences. Researchers on these disciplines are recording the activity and mapping the connectivity of neuronal structures trying to understand how the brain works. These brain sciences have always been closely related to computer science. Understanding how biological systems process sensory signals may provide important clues for building artificial systems that will operate in the real world. In Computing Machinery and Intelligence [1], an article published in 1950 in Mind, Alan Turing argues that computing devices could ultimately emulate intelligence, leading to his proposed Turing test. In The Computer and the Brain [2], published in 1958 after his dead, John von Neumann discussed how the brain can be viewed as a computing machine and identified differences between brain and computers of his day. In his book Vision [3], from 1982, David Marr proposed that the visual function could be studied at algorithmic levels whatever the underlying physical hardware is. Nowadays, the bonds between these two fields are becoming even stronger, providing new possibilities for brain scientists to connect behavior-function-structure in ways that were heretofore impossible. Data related to brain research has exploded. New imaging modalities are dramatically increasing the resolution, scale, and volume of brain imaging data: DWI1 , EM2 , CLEM3 ... Also traditional technologies, like MEG4 and EGG5 , are used to record synchronized neural activity at a very high temporal resolution which is key to understand dynamic cognitive processes. The structure of the nervous system is extraordinarily complicated because individual neurons are interconnected to hundreds or even thousands of other cells. Connectomics is the production and study of comprehensive maps of such networks at the level of synaptic connections. EM Connectomics, MR Connectomics are some of the techniques involved in this goal. The human connectome can be viewed as a graph where algorithms from Graph theory can be applied. The Human Connectome Project (HCP), a 5-year scientific project sponsored by the NIH6 , is intended to build a connectome that will lead to a better understanding of the anatomical and functional connectivity within the healthy human brain.. 1. Diffusion-Weighted Magnetic Resonance Imaging is the use of specific MRI sequences to map the diffusion process of molecules in biological tissues. 2 Electron Microscope uses a beam of accelerated electrons as a source of illumination. 3 Correlative Light-Electron Microscope is the combination of an optical microscope with an electron one 4 Magnetoencephalogram is the record of brain magnetic fields. 5 Electroencephalogram is the record of brain electrical fields. 6 National Institutes of Health is the primary agency of the US goverment responsible for biomedical and public health research..

(18) 1.1 Contextual Framework. 3. New models for machine learning inspired by neural architectures are reigniting the interest on biomimetic algorithms. Anyway the connections between these algorithms and the operating principles of the brain remain unclear. Once understood, the computational efficiency of the brain may inspire new computer architectures. Neuromorphic computers mimic neuro-biological architectures with VLSI7 analog circuits in contrast to the von Neumann architecture. The Human Brain Project, a 10-year scientific research project coordinated by the École Polytechnique Fédérale de Lausanne and largely funded by the European Union, has performed extensive research to identify which computer architecture is best-suited to study whole-brain networks efficiently. HBP is intended to build a collaborative scientific infrastructure of large-scale neuromorphic machines based on exascale8 computers, to allow researchers across Europe to advance knowledge in cognitive neuroscience and brain-inspired computing. Biological neurons, like those ones found in brain cells communicate primarily by emitting ”spikes” of pure electro-chemical energy. Spiking Neural Networks more closely mimic natural neural networks, incorporating the concept of time into their operating model (in addition to neuronal and synaptic state). TrueNorth is a neurosynaptic processor composed by 4096 cores, each one having 256 programmable simulated neurons adding up a total of over a million neurons Based on this chip, IBM presented at CPVR9 2017 a SNN capable of recognizing hand gestures in realtime from images captured from an event camera10 . All with a consumption under 200 mW. Loihi is a neuromorphic research test chip designed by Intel Labs and formally presented at NICE11 2018. It features a unique programmable microcode learning engine for on-chip SNN training. SpiNNaker, popularly known as the ”human brain supercomputer”, has been developed with the aim of enabling large-scale neural network simulations in real time with low power consumption. It has been turned on by the time this document is disclosed, SpiNNaker mimics our brain’s massively parallel communication network, sending billions of small amounts of data simultaneously to several thousand different destinations. Simultaneously, new methods for acquiring and processing behavior data are emerging from the mobile device revolution. These unprecedented quantities of digital information at unprecedented rates require algorithmic and architectural breakthroughs. Cloud-based computing provide affordable access to the huge computational power in a pay-as-you-go basis. Among other advantages, these platforms do not require any initial investment and are much more reliable and consistent than 7 8 9 10 11. very-large-scale integration systems computing systems capable of at least 1 exaflop per second. Conference on Computer Vision and Pattern Recognition - http://cvpr2018.thecvf.com/ Event Cameras transmit a data packet whenever a pixel detects a change. Neuro Inspired Computational Elements Workshop - http://niceworkshop.org/.

(19) 4. 1 INTRODUCTION. in-house infrastructure. AI accelerators, custom microchips designed to accelerate deep learning (only for inference so far), are already in their second generation. Google TPUv212 and Intel NCS213 inference latencies outperform dramatically those achieved with the latest GPUs14 .. 1.2. Motivation. As stated previously, new medical imaging techniques generate tremendous amounts of measurement data, making its processing almost impossible for a human. While, at one end of the scale, scientist can observe neuroanatomy at nanometer resolution; at the other end the whole brain functional behavior can be monitored over extended periods and under a variety of stimuli. The challenge is to relate the many different scales and modalities of data in ways that will support new kinds of scientific collaboration. As a result, computational analysis and modeling has become necessary to extract relevant information from this huge volumes of data. Mitochondria are subcellular organelles found in the cells of every complex organism, responsible for crucial tasks like energy production. Evidence suggests the relation between mitochondrial abnormalities and degenerative disorders related to aging, such as Alzheimer’s and Parkinson’s diseases. These studies have raised the need for detailed and high-resolution analysis of physical alterations in mitochondria. The Electron Microscope is a type of microscope that uses a beam of electrons to create an image of the specimen. It is capable of much higher magnifications and has a greater resolution than a light microscope, allowing it to see much smaller objects in finer detail. EM is key to mapping the morphology of neural structures. Latest EM imaging techniques permit the automatic acquisition of a large number of serial sections from brain samples. The sheer size and complexity of a typical EM image stack renders many segmentation schemes intractable. Besides, due to the variety of mitochondrial structures, automated segmentation and reconstruction in EM images results a challenging task. Various research works recently addressed this problem. In [4], the authors propose an automated graph partitioning scheme that focuses the issue of cluttered membranes belonging to numerous objects. In [5], the fact that mitochondria have thick dark membranes is exploited by an active surface-based method to refine the boundary surfaces. In [6], anisotropy-aware regularization is used via conditional random 12 13 14. Tensor Processing Unit v2 Neural Compute Stick 2 Graphic Processing Unit.

(20) 1.3 Structure. 5. field inference. In [7], the problem is tackled by a non-parametric higher-order model that uses a patch-based representation of its potentials. In [8], an approximate subgradient descent algorithm is used to minimize the margin-sensitive hinge loss in SSVM15 frameworks. In [9], it is proposed a 3D fully residual convolutional network with a deeply supervised strategy.. 1.3. Structure. This master thesis is structured as follows: 1. The first chapter embodies this introduction. 2. The second chapter is an introduction to the image segmentation problems and the different metrics employed to measure the performance of these models. The reader will find a summary of the classical approaches employed before deep learning took over the computer vision field. 3. The third chapter presents the state-of-the-art in deep learning applied to classification and segmentation problems. The reader will find a description of the latest architectures and techniques developed for these tasks. 4. The fourth chapter presents the experiments conducted on our own implementation of an encoder-decoder network, on which it is evaluated the performance of different loss functions. The reader will find the motivation to realize this work, a summary of every single step taken and a description of the results achieved. 5. The fifth and final chapter summarizes lines of further research and potential improvement to these results.. 15. Structured Support Vector Machine generalizes the SVM classifier enabling to output structured output labels..

(21) 6. 1 INTRODUCTION.

(22) 7. 2. FUNDAMENTALS OF IMAGE SEGMENTATION. Image segmentation, a hotspot in the computer vision field, is the process of labeling specific regions of an image according to what is being shown. A digital image is a representation of two-dimensional information and extracting this knowledge to accomplish some other task is an important area of application. Furthermore, segmentation is the first step to take when trying to understand an scene. In particular, this thesis is aimed at semantic segmentation models, whose goal is decomposing an image into regions which are homogeneous according to some criteria and have some semantic interpretation; all this while ensuring spatial consistency. The output of these models is a single segmentation map, ideally with the same spatial dimensions as the input image. As this means to predict a class for each image pixel, semantic segmentation is sometimes referred as dense prediction. It is important to note that this approach only distinguishes between different classes and not between different instances of the same class. Semantic segmentation maps do not inherently distinguish these as separate objects. On the other hand, instance segmentation models do distinguish between separate objects of the same class. The output of models is a collection of local segmentation masks describing each object detected in the image. Current state-of-the-art in image instance segmentation is MaskR-CNN [10], by Facebook AI Research, but that is beyond the scope of this thesis.. (a) Semantic. (b) Instance. Fig. 1: Output of different segmentation models.. Image segmentation technology is widely applied in several tasks like medical diagnosis, face recognition, pedestrian detection....

(23) 8. 2.1. 2 FUNDAMENTALS OF IMAGE SEGMENTATION. Region-based Segmentation. Region-based segmentation methods operate iteratively by grouping together pixels which are neighbours and have similar values and splitting groups of pixels which are dissimilar in value. This section presents different region-based segmentation algorithms.. 2.1.1. Threshold. Thresholding, the simplest method used for image segmentation, divides a gray-scale image based on the gray value of different targets. Thresholding methods include global and local algorithms. Global threshold methods divide an image into two regions, target and background, by a single threshold. Local threshold methods divide the image into multiple regions, targets and backgrounds, by multiple thresholds. The most common threshold segmentation algorithm is OTSU [11], named after its creator, Nobuyuki Otsu. It calculates the optimum threshold by separating two classes so their combined intra-class variance is minimal and their inter-class variance is maximal. Several approaches have been proposed to calculate this threshold: entropy-based, minimum error, co-occurrence matrix, moment preserving .... (a) Original. (b) Segmentation. Fig. 2: Image thresholded using OTSU algorithm.. When the target and the background have high contrast, thresholding methods obtain accurate results with no significant computational cost. Contrarily, these algorithms suffer when there is no significant gray scale difference in the image or a large overlap of the gray scale values..

(24) 2.2 Edge Detection Segmentation. 2.1.2. 9. Regional Growth. The basic idea of this segmentation algorithm is to have similar properties of the pixels together to form a region. Regions hold these conditions: n . Ri = R ,. i=1. Ri. is a connected region ∀i = 1...n , Ri ∩ Rj = ∅ ∀i = j , ∀i = 1...n , P (Ri ) = T rue ∀i = j. P (Ri Rj ) = F alse. (1). The first step is the selection of the seed points which constitute the initial region. This selection is based on user needs (i.e. pixels in a certain grayscale range). After this, the algorithm decides whether or not the neighboring pixels of this region should be added to the region. Again, different membership criteria can be applied (i.e. pixel intensity, grayscale texture, color..). This process is iterated on like a clustering algorithm.. 2.2. Edge Detection Segmentation. Edges usually appear in the form of discontinuous local features on the boundary between two regions which can be detected using derivative operations. The edge representation of an image significantly reduces the quantity of data to be processed while retaining essential information regarding the shapes of objects in the scene. Points, lines and edges constitute different types of discontinuities in the gray level that can be detected by just convolving the image with a template. The most commonly used first-order differential operators are Prewitt, Roberts and Sobel. Second-order differential operators include non-linear operators such as Laplacian, Kirsch and Wallis.. 2.2.1. Sobel Operator. The Sobel operator is a first order differential operator for edge detection. It approximates the gradient of the image luminance function which is not continuous. This technique performs a 2-D spatial gradient quantity on an image and so highlights regions of high spatial frequency that correspond to edges..

(25) 10. 2 FUNDAMENTALS OF IMAGE SEGMENTATION. The operator consists of two sets of 3 × 3 matrices, transverse and longitudinal templates, which are plotted with the image plane to obtain the horizontal and the longitudinal difference: ⎛ ⎞ −1 0 1 0 2⎠ , (2a) Gx = ⎝−2 −1 0 1 ⎛ ⎞ 1 2 1 0 0⎠ . Gy = ⎝ 0 (2b) −1 −2 −1 By combining the horizontal and vertical gradient approximations, the magnitude (strength) of the edge can be expressed as G=. G2x + G2y ,. (3). and the orientation is given by Θ = arctan. Gy Gy. .. (a) Original. (4). (b) Outcome. Fig. 3: Sobel operator segmentation.. 2.2.2. Laplacian Operator. The Laplacian operator is a second order differential operator for edge detection. It is the simplest isotropic differential operator with rotational invariance. The Laplacian of an image f (x, y) is defined as ∇2 f =. ∂ 2f ∂ 2f + . ∂x2 ∂y 2. (5).

(26) 2.3 Active Contours. 11. The Laplacian edge detector calculates second order derivatives in a single pass, using only one kernel: ⎡ ⎤ 0 1 0 ∇2 = ⎣1 −4 1⎦ . (6) 0 1 0 An extended template allows to detect diagonal edges as well: ⎡. ⎤ 1 1 1 ∇2 = ⎣1 −8 1⎦ . 1 1 1. (7). This operator is extremely sensitive to noise. By definition, its response to isolated pixels is stronger than that to edges or lines. In the presence of noise, the Laplacian operator is usually combined with a smoothing operator (i.e. Gaussian blur) to generate a new template.. (a) Original. (b) Outcome. Fig. 4: Laplacian operator segmentation.. 2.3. Active Contours. Active Contours models are boundary detectors that iteratively move towards a solution guided by a combination of image information and user-defined forces. This section present three usual approaches to these models, which are known to perform well when locating boundary curves in images.. 2.3.1. Snakes. A snake [12] is an energy-minimizing, two-dimensional spline curve that evolves towards specific image features guided by internal and external forces..

(27) 12. 2 FUNDAMENTALS OF IMAGE SEGMENTATION. • Internal spline forces impose a piecewise smoothneess constraint: Eint =. α(s)|vs (s)|2 + β(s)|vss (s)|2 . 2. (8). Adjusting α(s) and β(s) controls the relative importance of first-order and second-order terms. Setting β(s) to zero at a point allows the snake to become second-order discontinuous and develop a corner. • Image forces push towards image features like lines, edges and subjective contours. It can be written as a weighted combination of three energy functionals: Eimage = wline Eline + wedge Eedge + wterm Eterm .. (9). 1. To detect lines, the image intensity is the simplest energy functional: Eline = I(x, y) .. (10). Depending on the sign of wline the snake will be attracted to either light or dark lines. 2. Edges can be found by using the Laplacian operator as energy functional. The snake will be attracted by contours with large image gradients. Eedge = −(Gσ ∇2 I)2. (11). 3. To find corners and terminations of segments, a slightly smoothed image is used. Let C(x, y) = Gσ (x, y) ∗ I(x, y), Eterm =. Cyy Cx2 − 2Cxy Cx Cy + Cxx Cy2 ∂ 2θ = . ∂ 2 n⊥ (Cx2 + Cy2 )3/2. (12). • External constraint forces push the snake near a local minimum. Therefore, the energy function of the snake can be written as a summation of the energies of this forces: l l l Eint (v(s))∂s + Eimage (v(s))∂s + Econ (v(s))∂s . (13) Esnake = 0. 0. 0. As edges, lines, and subjective contours can all be found by essentially the same mechanisms, the snake model provides a unified treatment to a collection of visual problems that have been treated differently in the past. The standard solution to Equation 13 based on partial differential equations and level sets requires the use of numerical methods that are costly computationally and may have stability issues. In [13], it is presented an efficient and numerically stable approach to contour evolution based on morphological operations..

(28) 2.3 Active Contours. 2.3.2. 13. Intelligent Scissors. Intelligent Scissors [14] reformulate boundary definition as a graph search problem where the goal is to find the optimal path between a start node and a set of target nodes. This method allows to accurately extract objects represented in digital images by using simple gesture motions with a mouse. When the gestured mouse position comes in proximity to an object edge, a live-wire boundary wraps around the object of interest. To compute this optimal curve path, the image has to be pre-processed to associate lower costs to edges that are likely to be boundaries. Let l(p, q) be the local cost on the directed link from pixel p to a neighboring pixel q, the local cost function is calculated as a weighted sum of three component functionals: (14) l(p, q) = wZ · fZ (q) + wD · fD (p, q) + wG · fG (q) . Being IL (q) the Laplacian zero-crossing of an image I at pixel q, a binary feature fZ (q) is created as: 0 if IL (q) = 0 fZ (q) = . (15) 1 if IL (q) = 0 Being G the gradient magnitude, the gradient component function fG (q) is defined as: G . (16) fG = 1 − max(G) Letting D(p) be the unit vector perpendicular to the gradient direction at point p, the formulation of the gradient direction feature cost is: fD (p, q) =. 1 {cos[dp (p, q)]−1 + cos[dq (p, q)]−1 } . π. (17). As the user traces a rough curve, the system continuously recomputes the lowest cost path between the starting seed point and the current mouse location using Dijkstra’s algorithm.. 2.3.3. Level Sets. Level-Set methods (LSM) are a conceptual framework for using level sets as a tool for numerical analysis of surfaces and shapes..

(29) 14. 2 FUNDAMENTALS OF IMAGE SEGMENTATION. In mathematics, a level set of a real-valued function f (x1 , ..., xn ) is a set where the function takes on a given constant value c: Lc (f ) = (x1 , ..., xn )|f (x1 , ..., xn ) = c .. (18). A limitation of active contours based on parametric curves (i.e. snakes) is that it is challenging to change the topology of the curve as it evolves. An alternative representation for such closed contours is to use a level set, where the zero-crossing of a characteristic function φ define the curve. Instead of the curve f (s), Level Sets evolve to fit objects of interest by modifying this underlying embedding function.. 2.4. Clustering Segmentation. Clustering is the task of grouping a set of objects in such a way that objects in the same group are closer in the feature space to each other than to those in other groups. As it is easy to draw parallels between this and how semantic segmentation classifies pixels in an image, the segmentation of images using clustering algorithms has been an important line of research. This section presents a few clustering algorithms that perform well when applied to dense segmentation tasks.. 2.4.1. Watershed. In the computer vision field, a watershed is a transformation defined on a grayscale image as a topographic map. Its name is inspired by the geological watershed, or drainage divide, which separates adjacent drainage basins. Various algorithms can be used to compute watersheds. One of the most popular approaches is Watershedby-Flooding, which consists in placing a water source in each regional minimum and flooding the entire relief from sources. The resulting set of barriers where the water sources meet constitute the edges of the segmentation. Priority-Flood is an improvement to this approach that operates by flooding a DEM16 inwards from their edges using a priority queue to determine the next cell to be flooded. Watershed segmentation generates spatially homogeneous regions which are usually oversegmented due to noise or local irregularities in the gradient. In order to further control the size and number of regions, instead of using the original image, the gradient magnitude is commonly used (with a preprocessing step to smooth the gradient image). 16. Digital Elevation Model is a representation of terrain elevations above some common base level, usually stored as a rectangular array of floating-point or integer values.

(30) 2.4 Clustering Segmentation. 2.4.2. 15. K-means. Originally developed for signal processing, K-means is the simplest and most-used clustering technique. It aims to partition all observations into k clusters, each observation belonging to the cluster with the closest mean. The algorithm is given the number of clusters k it is supposed to find and the initial centroids for each of them. K-means iteratively updates this cluster seed locations based on the samples that are closest to each centroid. • Initialization: due to the gradient descent optimization, this approach is highly sensitive to the initial placement of the cluster centers. Commonly used initialization methods are Forgy and Random Partition. – The Forgy method randomly chooses k observations from the dataset as the centroids for each cluster. – The Random Partition method first randomly assigns a cluster to each observation and then calculate de centroids for each cluster. • Assignment: each of observation is assigned to the cluster whose centroid is closest. While different metrics can be applied, usually the squared Euclidean distance is used. Mathematically, the result is the Voronoi diagram generated by choosing the centroids as seeds. (t). (t). (t). Si = {xp : xp − mi ≤ xp − mj ∀j, 1 ≤ j ≤ k}. (19). • Update: Calculate the new means to be the centroids of the new clusters. 1 = t xj (20) mt+1 i |Si | t xj ∈Si. K-means implicitly models the probability density as a superposition of symmetric distributions, even though no probabilistic reasoning is performed.. 2.4.3. Mixtures of Gaussians. An image can be seen as a matrix of numbers where each value represents the intensity or color of a point. Let X be a random variable that takes these values. We can suppose the probability density function of X to have a mixture of Gaussian distributions as the following form: f (x) =. k i=1. wi N (x|μi σi2 ) ,. (21).

(31) 16. 2 FUNDAMENTALS OF IMAGE SEGMENTATION. 1. N (x|μi Σi ) = e 2πσi2. −(x−μi )2 2σ 2 i. ,. (22). where k is the number of classes in the image, μi and Σi are mean and covariance matrix for class i, and wi > 0 are weights such that ki=1 wi = 1. To calculate a maximum likely estimate for the unknown mixture parameters pi , μi , σi , the Expectation-Maximization algorithm is used. EM algorithm is an iterative method to find maximum likelihood estimates of parameters in statistical models. 1. Initialization: mean μi , covariance matrix Σk and weight wk are given initial values for each cluster. 2. Expectation step: a function is created for the expectation of the log-likelihood evaluated using the current estimate for the parameters. zik =. 1 wk N (x|μk , Σk ) Zi. (23). 3. Maximization step: recomputation of parameters maximizing the expected log-likelihood found on the previous step. 1 zik xi Nk i. (24). 1 zik (xi − μk )(xi − μk )T Nk i. (25). Nk N. (26). μk =. Σk =. wk =. 2.4.4. Mean-Shift. Mean-shift is a non-parametric algorithm for cluster analysis of complex multimodal feature spaces. The key to this algorithm is a technique for efficiently finding peaks in this high-dimensional data distribution without computing the complete function explicitly. Given a sparse set of samples, the simplest approach to estimate the density function is to smooth the data by convolving it with a kernel k(r) of width h: |x − xi |2 f (x) = . (27) K(x − xi ) = k h2 i i.

(32) 2.4 Clustering Segmentation. 17. The derivative of this density function is given by: ∇f (x) =. . (xi − x)G(x − xi ) =. i. . (xi − x)g. i. |x − xi |2 h2. ,. (28). being g(r) the first derivative of k(r). Finding local maxima in this function using gradient ascent or any other optimization technique may not be computationally affordable for high-dimensional search spaces. To overcome this issue, Mean-Shift employs Multiple Restart Gradient Descent as optimization algorithm. The gradient of f (x) can be expressed as: ∇f (x) = G(x − xi ) m(x) ,. (29). i. where the mean-shift vector m(x) represents the difference between the weighted mean of the neighbors xi around x and the current value of x. xi G(x − xi ) −x (30) m(x) = i i G(x − xi ) xi G(yk − xi ) yk+1 = yk + m(yk ) = i (31) i G(yk − xi ) This method applied to segmentation problems was first introduced in [15]. The paper proves the convergence of this algorithm to a local maximum under reasonably conditions on the kernel k(r). kE (r) = max(0, 1 − r). 2.4.5. (32). Normalized Graph Cuts. Normalized Graph Cuts [16] is a kind of spectral clustering developed particularly for image segmentation. In this approach, the pixels of the image form the nodes of a graph whose weighted edges represent similarity between pixels, and the algorithm cuts the graph into two subgraphs. Let G = (V, E) be a graph whose nodes are image pixels and whose edges have a weight w(i, j) representing the similarity between nodes i and j. G can be partitioned into two disjoint graphs with node sets A and B by removing any edges that connect nodes in A with nodes in B..

(33) 18. 2 FUNDAMENTALS OF IMAGE SEGMENTATION. In graph theory, the cut of this partition represents the degree of dissimilarity between these two sets A and B: w(u, v) . (33) cut(A, B) = u∈A,v∈B. The optimal bi-partitioning of a graph is the one that minimizes this cut value , so that the similarity within the sets is high and low across different sets. However, this criterion favors cutting small sets of isolated nodes, which it is not helpful when finding large uniform colors or textures. The Normalized Cut is another disassociation measure that avoids this bias for partitioning out small sets of points by computing the cut cost as a fraction of the total edge connections in the graph: N cut(A, B) =. cut(A, B) cut(A, B) + , assoc(A, V ) assoc(B, V ). (34). where assoc(A, V ) is the total connection from nodes in A to all nodes in the graph and assoc(B, V ) is similarly defined. w(u, t) (35) assoc(A, V ) = u∈A,t∈V. Given an image (or an image sequence), the grouping algorithm performs the following steps: 1. Set up a weighted graph G(V, E) with nodes representing pixels and weights measuring disassociation between nodes. 2. Let d be a vector whose elements di represents the total connection from node i to all other nodes: w(i, j) . (36) di = j. 3. Let x be a vector whose elements xi are equal to 1 if node i is in A and -1 otherwise and let y be a continuous approximation to vector x: di (37) y = (1 + x) − xi >0 (1 − x) . xi <0 di 4. Find the eigenvalues λ by solving the equation: (D − W )y = λDy .. (38). 5. Use the eigenvector with the second smallest eigenvalue to bipartition the graph. 6. Decide if the current partition should be subdivided in a recursive manner..

(34) 2.5 Random Fields. 2.5. 19. Random Fields. In mathematics, a Random Field is a generalization of the classic stochastic processes such that the underlying parameter need no longer be a simple value (continuous or discrete), but can instead take multidimensional values.. 2.5.1. Markov Random Fields. A Markov Random Field is a set of random variables having a Markov property described by an undirected graph. Given an undirected graph G = (V, E), a set of random variables X = (Xv )v∈V indexed by V are said to form a MRF with respect to G if they satisfy the local Markov properties: • Pairwise: any two non-adjacent variables are conditionally independent given all other variables. • Local Markov property: a variable is conditionally independent of all other variables given its neighbors. • Global Markov property: any two subsets of variables are conditionally independent given a separating subset. According to the Hammersley-Clifford theorem, a random field is a MRF if and only if P (X) follows a Gibbs distribution. In that case it can be expressed as a factorization over the complete subgraphs of the graph, called cliques. A textbfclique is a graph subset S ⊆ G where every pair of pixels in this subset are neighbors. A value Vc (x) is assigned to every clique c in the image. Vc (x) = Vc1 (xi ) + Vc2 (xi , xj ) + ... (39) U (x) = c∈G. i∈c1. (i,j)∈c2. The energy function of an MRF model is given by: U (x) = Vc (x) = Vc1 (xi ) + Vc2 (xi , xj ) + ... c∈G. i∈c1. (40). (i,j)∈c2. In the binary case, pixel classes are represented by Gaussian distributions: 1 P (ys |xs ) = √ 2πσxs. e. −. (yx −μxs )2 2 2σx s. ,. (41).

(35) 20. 2 FUNDAMENTALS OF IMAGE SEGMENTATION. U (x) =. s. √ (yx − μxs )2 log( 2πσxs ) + 2σx2s. +. . βδ(xs , xr ) .. (42). s,r. This way, the segmentation problem is reduced to the minimization of a non-convex energy function (usually using gradiet descent algorithms): y M AP = argmaxy∈Ω. U (x) .. (43). MRF models are able to capture contextual constraints of real images where neighboring pixels usually have similar properties like intensity or color MRFs suffer from two key limitations: • Due to the complexity of inference and parameter estimation, generally only local relationships between neighboring nodes are incorporated into the model making it highly inefficient at capturing long range interactions. • Because of their generative17 nature, many labeled images are required to estimate the parameters of these models. Even when this posterior probability is simple, the true underlying generative model may be quite complex.. 2.5.2. Conditional Random Fields. Conditional Random Fields [17] is a discriminative18 classifier used to segment and label sequence data. They are a variant of MRFs in which each random variable may also be conditioned upon a set of global observations o. Let G = (V, E) be a graph such that Y = (Yv )v∈V is indexed by the vertices of G, (X, Y ) is a CRF in case the random variables Yv obey the Markov property with respect to the graph: p(Yv |X, Yw , w = v) = p(Yv |X, Yw , w ∼ v) ,. (44). where w ∼ v means that w and v are neighbors in G. This graph G encodes the conditional distribution 1 1 φi (Di ) , P̂ (X, Y ) = Z(X) Z(X) i=1 m. P (Y |X) =. (45). with partition function Z(X) =. Y. 17 18. P̂ (X, Y ) =. m Y. φi (Di ) .. i=1. Generative models learn the joint probability distribution p(x,y) Discriminative models learn the conditional probability distribution p(y|x). (46).

(36) 2.6 Support Vector Machines. 21. To build the conditional field, each feature function is assigned a set of weights λ so, given fixed features fk and gk , the joint distribution over the label sequence Y given X has the form: λk fk (e, y|e , x) + μk gk (v, y|v , x) . (47) pθ (y|x) = exp e∈E,k. v∈V,k. CRFs typically involve a local potential and a pairwise potential. The local potential is usually the output of a pixelwise classifier (usually not smooth) while the pairwise potential favors pixel neighbors which don’t have a gradient between them to have the same label, making the segmentation smoother. Parameter vector σ(λ1 , λ2 , ..., μ1 , μ2 ) that maximizes the log-likelihood of the training data is estimated by Iterative Scaling 19 algorithms.. 2.6. Support Vector Machines. Support Vector Machines [18], also known as support vector networks, are supervised learning models used for classification and regression analysis, well-known for its generalization ability. Its objective is to find the optimal hyperplane w · x + b = 0 that separates two classes in a given dataset with maximal margin. Such optimal hyperplane can be built taking into account a small amount of the training examples, called support vectors, which determine this margin. w can be expressed as some linear combination of the support vectors w = support αi zi The smaller number of suport vectors relative to the training set size, the higher network ability to generalize. When the two classes are linearly separable, there exists a vector w and a scalar b such that the following inequalities hold for any training example: w · xi + b ≥ 1. if. w · xi + b ≤ −1 if. yi = 1 ,. (48a). yi = −1 .. (48b). In this case, the training dataset can be separated by two parallel hyperplanes so that the distance between them, called margin, is maximized. The maximum-margin hyperplane, the unique one which separates the classes with a maximal margin, is the hyperplane that lies halfway between them: yi (w · xi + b) ≥ 1 . 19. Iterative Scaling is an algorithm used to fit log-linear models. (49).

(37) 22. 2 FUNDAMENTALS OF IMAGE SEGMENTATION. By using the Hinge loss, SVMs can extend this approach to the case when separation without error is impossible: max (0, 1 − yi (w · xi + b)) .. (50). When xi is correctly classified, Equation 49 is satisfied and the function returns zero. Otherwise, the function returns a value proportional to the distance from the margin. So the function to be minimized is: min w, b. 1 n n. i=1. max (0, 1 − yi (w · xi + b)) + λ w2 .. (51). Parameter λ determines a trade-off between ensuring xi is correctly classified (right side of the margin) and increasing the margin-size for better generalization. Soft-margin models usually perform better even when the training dataset is linearly separable, as hard-margin ones are very sensitive to noise (a single outlier may have strong impact in the decision boundary). SVMs also perform non-linear classification efficiently by non-linearly mapping the original input vectors to a higher dimension feature space, where a linear decision surface is constructed. This approach is called the kernel trick. φ : Rn → R N. (52).

(38) 2.7 Metrics for Semantic Segmentation. 2.7. 23. Metrics for Semantic Segmentation. When it comes to evaluating the performance of a binary classifier, predictions are usually classified into four categories. True and False Positives (T P /F P ) refer to the number of predicted positives that were correct/incorrect. Similarly, True and False Negatives (T N /F N ) refer to the number of predicted negatives. The confusion matrix, also known as error matrix, is a specific contingency table layout with two dimensions. Each column of this matrix represents the instances in a real class while each row represents the instances in a predicted class.. Fig. 5: Confusion Matrix. • True Positive Rate, also known as Recall or Sensitivity, is defined as the proportion of real positive cases that were correctly predicted as positive: TPR =. TP . TP + FN. (53). • False Positive Rate, also known as Fallout, is defined as the proportion of real negative cases that were incorrectly predicted as positive: FPR =. FP . FP + TN. (54). • True Negative Rate, also known as Specificity, is defined as the proportion of real negative cases that were correctly predicted as negative: T NR =. TN = 1 − FPR. FP + TN. (55). • False Negative Rate, is defined as the proportion of real positive cases that were incorrectly predicted as negative: F NR =. FN = 1 − TPR. FN + TP. (56).

(39) 24. 2 FUNDAMENTALS OF IMAGE SEGMENTATION. • Precision or Confidence denotes the proportion of predicted positive cases that were truly real positives: P recision =. TP . TP + FP. (57). • Accuracy is defined as the ratio of correctly classified examples over all available elements: TP + TN . (58) Accuracy = TP + FP + TN + FN • Misclassification Rate is calculated as the ratio of wrongly classified examples over all available elements: MR =. FP + FN = 1 − Accuracy . TP + FP + TN + FN. (59). The simplest way to evaluate a semantic segmentation is to simply report the percent of pixels in the image which were correctly classified. Pixel Accuracy metric is defined as: 1 Sg ∩ Sp1 + Sg0 ∩ Sp0 TP + TN . (60) == PA = Sg0 + Sg1 TP + FP + TN + FN This metric provides misleading results when the class representation is small within the image, as the measure will be biased in mainly reporting how well the model identifies the negative cases.. 2.7.1. Pair-counting metrics. First, let us define the four basic pair-counting cardinalities: 1 a = [T P (T P − 1) + F P (F P − 1) + T N (T N − 1) + F N (F N − 1)] , 2 1 b = [(T P + F N )2 + (T N + F P )2 − (T P 2 + T N 2 + F P 2 + F N 2 )] , 2 1 c = [(T P + F P )2 + (T N + F N )2 − (T P 2 + T N 2 + F P 2 + F N 2 )] , 2 1 d = n(n − 1) − (a + b + c) . 2. (61a) (61b) (61c) (61d). • Rand Index between two segmentations is defined as: RI =. a+b . a+b+c+d. (62).

(40) 2.7 Metrics for Semantic Segmentation. 25. • Advanced Rand Index is a version of the RI adjusted for the chance grouping of elements: ARI =. 2.7.2. c2. +. b2. 2(ad − bc) . + 2ad + (a + d)(c + b). (63). Information Theory metrics. These metrics are based on the marginal entropy H(S) and the joint entropy H(S1 , S2 ) between images defined as: p(si )log p(si ) , (64a) H(S) = − i. H(S1 , S2 ) = −. . p(si1 , sj2 )log p(si1 , sj2 ) .. (64b). i,j. where S i are the regions in the image segmentation and p(S i ) are the probabilities of these regions. • Mutual Information between two variables is a measure of the amount of information one variable has about the other. MI is based on regions (segments) instead of individual pixels: M I(Sg , Sp ) = H(Sg ) + H(Sp ) − H(Sg , Sp ) .. (65). • Variation of Information measures the amount of information lost (or gained) when changing from one variable to the other: V OI(Sg , Sp ) = H(Sg ) + H(Sp ) − 2M I(Sg , Sp ) .. 2.7.3. (66). Overlap-based metrics. • F-score or F-measure is defined as the weighted harmonic mean of the precision and recall of the test: P recision˙Recall Fβ = (1 + β)2˙ 2 . β ˙P recision + Recall • Dice coefficient is the specific case of Fβ where β = 1: 2 Sg1 ∩ Sp1 2T P DICE = 1 1 = . 2T P + F P + F N Sg + Sp. (67). (68).

(41) 26. 2 FUNDAMENTALS OF IMAGE SEGMENTATION. • Jaccard index is defined as the intersection between two sets divided by their union: 2 Sg1 ∩ Sp1 TP = IoU = 1 . (69) 1 TP + FP + FN Sg ∪ S p • Global Consistency Error is defined as the error averaged over all pixels: 1 F N (F N + 2T P ) F P (F P + 2T N ) GCE = min + n TP + FN TN + FP (70) F P (F P + 2T P ) F N (F N + 2T N ) . , + TP + FP TN + FN. 2.7.4. Volume-based metrics. • Volumetric Similarity is defined upon the volumetric distance, namely the absolute volume difference divided by the sum of the compared volumes: 1 |Sg | − |Sp1 | |F N − F P | VS =1−VD =1− =1− . (71) 1 1 |Sg | + |Sp | 2T P + F P + F N.

(42) 27. 3. DEEP LEARNING APPROACHES TO IMAGE SEGMENTATION. As shown in the previous chapter, before deep learning took over computer vision, other ML approaches were used for semantic segmentation. In 2012, AlexNet [19] was the first CNN architecture to win the Large Scale Visual Recognition Challenge (ILSVRC-2012) with a TOP-5 test accuracy of 84.6%. The closest competitor using traditional approaches achieved only 73.8% in the same dataset. The proposed configuration, depicted in Figure 6, was relatively simple as it consisted of five convolutional layers, max-pooling, ReLUs as non-linearities, three fullyconnected layers, and dropout [20]. AlexNet changed everything and, since then, CNN’s techniques are considered the state-of-the-art in image segmentation.. Fig. 6: AlexNet CNN architecture.. VGG [21] is a family of architectures presented in by the Visual Geometry Group from the University of Oxford as a result of an evaluation of different networks of increasing depth. The main difference with other previous configurations was the use of a stack of convolution layers with small receptive fields 3 × 3 in the first layers instead of few layers with big receptive fields. This leads to less parameters and more nonlinearities in between, making the decision function more discriminative and the model easier to train. Pushing the depth to 16–19 convolutional layers, it resulted in a significant improvement on the prior-art models. VGG-16, a family member composed by 16 weight layers, achieved a TOP-5 test accuracy of 92.7% in the ILSVRC-2013..

(43) 28. 3 DEEP LEARNING APPROACHES TO IMAGE SEGMENTATION. Fig. 7: VGG-16 architecture.. GoogLeNet [22], proposed by Google Inc., shows how CNN layers can be stacked in more ways than the usual sequential manner. This architecture, composed of 22 layers, introduces the Inception Module, a new building block consisting of different layers computed in parallel: the input is split into a few lower-dimensional embeddings (by 1 × 1 convolutions), transformed by a set of specialized filters (3 × 3, 5 × 5, etc.), and merged by concatenation. This split-transform-merge strategy is expected to approach the representational power of large and dense layers at a lower computational complexity. It can be shown that the solution space of this architecture is a strict subspace of the solution space of a single large layer operating on a high-dimensional embedding. The benefits of this configuration were verified on the ILSVRC-2014, achieving a TOP-5 test accuracy of 93.3%. Fig. 8: Inception module with dimension reductions..

(44) 3.1 Transposed Convolution. 29. These initial deep learning approaches were based on patch classification where each pixel was independently classified using a patch of image around it. These early segmentation networks, derived from those used in classification problems, had mainly two drawbacks when applied to dense prediction problems.. 1. Fully connected layers, included at the final stages of classification architectures, include a much higher number of parameters that make the learning process much more computationally expensive. Besides, these kind of layers accept only fixed sized images. 2. Pooling layers increase the field of view and are able to aggregate the context, at the expense of discarding the ‘where’ information. However, semantic segmentation requires the exact alignment of class maps and thus, needs the ‘where’ information to be preserved.. Next is presented a summary of the different approaches proposed to overcome these early limitations.. 3.1. Transposed Convolution. Networks designed for classification take a fixed-sized input and produce non-spatial output. When trying to scale to dense prediction tasks, fully connected layers make the learning process much more computationally expensive, as they include a much higher number of parameters. In [23], conducted at UC Berkeley, the authors popularized the use of Fully Convolutional Networks for semantic segmentation. It was the first architecture to get rid of the fully connected layers, allowing the network to accept images of any size and accelerating the training process as the number of parameters to learn dramatically reduces. Even though sometimes referred as deconvolution, transposed convolution is not the reverse calculation of convolution operation but can also recover spatial dimensions. Thus, it can be applied into segmentation to recover the feature map size to the original size. Contrary to convolutional layers, transposed convolution associates a single input activation with multiple outputs. As upsampling with factor f is equivalent to a convolution with a fractional input stride of 1/f , transposed convolution constitutes a natural way to upsample feature maps..

(45) 30. 3 DEEP LEARNING APPROACHES TO IMAGE SEGMENTATION. In literature, these ”deconvolutional” layers are also known as upconvolution, full convolution, transposed convolution or fractionally-strided convolution.. Fig. 9: Deconvolution operation.. Fully connected layers in classification networks can be viewed as convolutions with kernels that cover their entire input regions. This is equivalent to evaluating the original classification network on overlapping input patches but is much more efficient because computation is shared over the overlapping regions of patches. These deconvolutional filters do not need to be fixed (as in bi-linear interpolation), allowing even to learn a nonlinear upsampling by back-propagation.. Fig. 10: Convolution layers enable a classification network to output a heatmap.. After replacing fully connected layers in a classification network, feature maps still need to be upsampled because of pooling operations. Instead of using simple bilinear interpolation, deconvolutional layers can learn this..

(46) 3.2 Encoder-Decoder architectures. 3.2. 31. Encoder-Decoder architectures. Encoder-Decoder architectures are an evolution of FCNNs, composed of a contracting path and an expanding path. The contracting path (encoder) gradually reduces the spatial dimension by pooling layers while the expanding path (decoder) gradually recovers the object details and spatial dimension. There are usually shortcut connections from encoder to decoder to help decoder recover the object details better. A popular network from this class is U-Net [24], developed at the University of Freiburg in 2015, whose architecture is illustrated in Figure 11.. Fig. 11: U-net architecture (example for 32x32 pixels in the lowest resolution). U-Net contracting path follows the typical architecture of a convolutional network, comprised of repeated unpadded convolutions, each followed by ReLU unit and a max pooling operation for downsampling. At each downsampling step, the number of channels is doubled. U-Net expansive path consists of successive layers, where pooling operators have been replaced by upsampling operators, which increase the resolution of the output. Each upsampling layer output is passed through an “upconvolution” that halves the number of feature channels. These feature map is then concatenated with the correspondingly cropped feature map from the contracting path. The final layer is a 1 × 1 convolution which maps each feature vector to the desired number of classes..

(47) 32. 3 DEEP LEARNING APPROACHES TO IMAGE SEGMENTATION. Another member of this family is SegNet [25], illustrated in Figure 12. Its encoder network is composed by 13 convolutional layers identical to the first 13 layers found in VGG-16 [21]. The decoder consists of a hierarchy of 13 decoders, which use the max-pooling indices to perform non-linear upsampling as shown in Section 3.3. SegNet also introduces more shortcut connections to previous FCN designs.. Fig. 12: SegNet architecture is fully convolutional.. Despite their good results, translation invariance is often compromised in these architectures. Due to the pooling operation in the contracting path, feature maps get the expanding path with a relatively low resolution and need to be passed via skip connections from encoders to decoders. On the other hand, since convolution is a local operation, a model with no pooling layers would not be able to learn holistic features in the images. Different approaches have been proposed to get rid of this trade-off between spatial resolution and expansion of the receptive field.. 3.3. Unpooling. Pooling layers in CNNs are responsible for abstracting activations in a receptive field with a single representative value. During this process, spatial information is lost, which may be critical for semantic segmentation. To recover spatial resolution, the up-sample stage usually employs bi-linear interpolation due to its computational efficiency and good recovery of the original image. Deconvolutional networks [26] employ unpooling layers which perform the reverse operation to reconstruct the original size of activations. The locations of maximums selected during the pooling operation are recorded and employed later to place each activation back to its original location..

(48) 3.3 Unpooling. 33. Fig. 13: Unpooling operation.. The output of this operation is enlarged in resolution, but yet sparse. To build a dense pixel-wise class prediction map, it needs be densified in a deconvolutional layer through convolution-like operations with multiple learned filters.. Fig. 14: Architecture of the proposed deconvolutional network.. Similar to the convolution network, a hierarchical structure of deconvolutional layers capture different level of shape details. While filter in lower layers capture the overall shape of an object, those in higher layers encode other fine details. This way, this networks count also on class-specific shape information which is ignored in configurations based only in convolutional layers.. (a) SegNet. (b) FCN. Fig. 15: Segnet decoder upsampling compared to FCN..

(49) 34. 3 DEEP LEARNING APPROACHES TO IMAGE SEGMENTATION. Segnet [25] follows the same principle using the max-pooling indices memorized from the corresponding encoders to perform non-linear upsampling.. 3.4. Strided Convolution. As seen previously, most CNNs used for computer vision tasks were built by stacking blocks composed of convolutional layers and max-pooling layers for downsampling, followed by a non-linear activation function. Reduction of the spatial dimension can also be achieved by convolving the input feature map wit a stride greater than one, allowing the network to learn the weights for the downsampling stage instead of imposing a pooling rule (maximum, minimum, average...) Being F the filter size, S the stride with which we slide the filter, and P the amount of zero-padding on the borders of the image, the spatial size of the output volume can be expressed as a function of the input volume size Win Wout =. Win − F + 2P +1 S. (72). When the stride is larger than 1 (2 or uncommonly 3 or more), the convolution operation will produce spatially smaller output volumes. This convolutional subsampling schema usually works better for generative models, like segmentation.. Fig. 16: Convolution operation with filter size 3 and stride 2.. The architecture proposed in [27] consists solely of convolutional layers as no sort of pooling layers is used to reduce the spatial size. The experiments conducted show how this network matches or even slightly outperforms the state-of-the-art on CIFAR-10 and CIFAR-100 datasets..

(50) 3.5 Dilated Convolution. 3.5. 35. Dilated Convolution. As seen previously, image classification networks integrate multi-scale contextual information via successive pooling layers that increase the receptive field while reducing resolution until a global prediction is obtained. In contrast, dense prediction tasks call for multi-scale contextual reasoning in combination with full-resolution output. Dilated convolution [28] is an operator particularly suited for this task due to its ability to expand the receptive field without decreasing spatial dimensions. All this at the expense of a linear increase in the number of parameters to learn. Let • F : Z2 → R be a discrete function, • Ωr =][−r, r]2 ∩ Z2 , • Ωr → R be a discrete filter of size (2r + 1)2 , The discrete convolution operator with dilation factor l is defined as: (F ∗l k)(p) =. . F (s)k(t) .. (73). s+lt=p. (a) 1-dilated conv.. (b) 2-dilated conv.. (c) 1-dilated conv.. Fig. 17: Systematic dilation.. In simple terms, a dilated convolution is just a convolution applied with defined gaps to the input. Systematic applying dilation, as illustrated in Figure 17, supports exponential expansion of the receptive field without loss of resolution or coverage, matching the needs of semantic segmentation..

(51) 36. 3.6. 3 DEEP LEARNING APPROACHES TO IMAGE SEGMENTATION. Large Kernel Size. A well-designed semantic segmentation model has to perform simultaneously two tasks that are naturally contradictory: classification and localization. While these models are required to be invariant to transformations like translation and rotation for the classification task, they also need to be sensitive to them in order to precisely locate every pixel. Previous semantic segmentation algorithms mainly follow design principles for localization, which may be sub-optimal for classification. Global Convolutional Network [29] is a novel architecture designed to address both issues. This fully convolutional network adopts a large kernel size to enable densely connections between feature maps and per-pixel classifiers, enhancing the capability to handle different transformations. The experiments conducted show how the performance on dense prediction tasks increases with kernel size.. (a) Classification.. (b) Localization.. (c) GCN.. Fig. 18: GCN addresses both the classification and localization issues.. As larger kernel sizes are computationally expensive, a k × k convolution can be approximated with sum of 1 × k + k × 1 and k × 1 and 1 × k convolutions. This configuration enables densely connections 2 within a large k × k region in the feature map. GCN structure involves only O k parameters, which is more practical for large kernel sizes. No non-linearity is performed after these convolution layers.. Fig. 19: Details of CGN module..

(52) 3.7 Residual Networks. 37. To improve the localization ability near the object boundaries, a Boundary Refinement module is introduced. This is basically a residual block to model the boundary alignment as a residual structure. While GCN architecture mainly improves the internal regions, Boundary Refinement increases performance near boundaries. In 2017, GCN outperformed previous state-of-the-art results by achieving 82.2% on PASCAL VOC 2012 dataset and 76.9% on Cityscapes dataset.. 3.7. Residual Networks. As described in previous sections, leading results on semantic segmentation and other computer vision tasks exploit very deep networks. Unfortunately, building these models brings new challenges aside from vanishing or exploding gradients. When training very deep configurations, accuracy trends to saturate and degrade rapidly with the network depth increasing. Such degradation is not caused by overfitting, as adding more layers to a suitably deep model leads equally to higher training error. This degradation problem can be addressed by introducing a deep residual learning framework. In this approach, the network configuration let each few stacked layers fit a residual mapping, instead of hoping them directly fit a desired underlying mapping. Denoting the desired underlying mapping as H(x), the stacked nonlinear layers fit another mapping of F (x) := H(x) − x. This formulation can be realized by feedforward neural networks with shortcut connections which simply perform identity mapping, and their outputs are added to the outputs of the stacked layers. This way, identity shortcut connections add neither extra parameter nor computational complexity.. Fig. 20: Residual learning building block.. Deep residual nets are the foundations of ResNet [30], a work from Microsoft Research, winner of ILSVRC-2015 classification task with a 3.57% error on the ImageNet test set..

(53) 38. 3 DEEP LEARNING APPROACHES TO IMAGE SEGMENTATION. In [31], two pure Inception variants, Inception-v3 and Inception-v4, are compared to similarly expensive hybrid Inception-ResNet versions, providing empirical evidence that residual connections accelerates the training. Residual variants started to exhibit instabilities if the number of filters exceeded 1000 because the last layer before the average pooling started to produce only zeroes after a few iterations. This kept happening even after adding an extra batch-normalization to this layer but it was found that scaling down the residuals before adding them to the previous layer activation seemed to stabilize the training process (scaling factor between 0.1 and 0.3). ResNeXt [32], an architecture proposed by UC San Diego and Facebook AI Research, combines residual connections with the split-transform-merge strategy from GoogLeNet [22]. This network is built by repeating a new building block, as illustrated in Figure 21. This block aggregates a set of transformations with the same topology. The size of this set, named ”cardinality” by the authors, constitutes an essential dimension factor and experiments demonstrate that increasing cardinality is a more effective way of gaining accuracy than going deeper or wider. By following this strategy, ResNeXt secured the second place in the ILSVRC 2016 classification task, achiev¡ng a 3.03% Top-5 error rate.. (a) ResNet.. (b) ResNeXt(cardinality=32).. Fig. 21: Building blocks for different residual networks.. Dilated Residual Networks [33] combine ResNets with dilated convolutions to preserve spatial resolution in convolutional networks originally designed for image classification. It allows to directly produce dense pixel-level class activation maps as discussed in Section 3.5. Therefore, a DRN trained for image classification can be immediately used for object localization and segmentation..