Anomaly Detection Preliminaries - Universidad Católica San Pablo: Unsupervised anomaly detectio

CHAPTER 2. Background

Digitally Reconstructed Radiographs.- Digitally Reconstructed Radiographs DRRs are computed images from CT data. These radiographs play an important role as reference images for image-guided therapy and for 2D−3D image registration. On the other hand, the Beer-Lambert law is designed for monochromatic light and its absorption increases with decrease in radiation wavelength. According to (Sherouse et al., 1990), the features of the DRRs implementation include three main factors:

(1) methods for interslice interpolation; (2) a method for approximating photoelectric and Compton linear attenuation coefficients from Hounsfield units; and (3) selectable pixel size and “film size” of the computed image. Additionally, DRRs have to consider other relevant factors: (a) projection of anatomic contours extracted from CT scans;

(b) projection of collimator edges, custom blocks, and crosshairs; (c) the ability to produce images with an arbitrary ratio of Compton to photoelectric interactions.

The Beer-Lambert law is considered a fundamental principle in the DRRs generation. According to the Beer-Lambert law, absorption of radiation depends on:

(1) intensity of the incident beam; (2) path length; (3) concentration of absorbing species; and (4) extinction coefficient. To compute the absorption of radiation, the following mathematical formulation is considered Equation 2.1:

A =εlc (2.1)

where:

• A is the absorbance

• ε is the molar attenuation coefficient or absorptivity of the attenuating species

• l is the optical path length in cm

• c is the concentration of the attenuating species

2.2. Anomaly Detection Preliminaries

2.2.1 Anomaly Detection Problem

There are a number of challenges that make the anomaly detection problem increasingly obscure. To begin with, the borderline between normal and anomalous behaviour is often imprecise. Also, in a certain domain such as intrusion detection, the normal behaviour is constantly evolving in such a manner that those changes might be mistakenly identified as outliers. On the other hand, the anomaly detection techniques need to be adapted to the different application domains. Moreover, the scarcity of labelled data for training and validation imposes limitations on the results and conclusions reached.

In anomaly detection, depending on the domain, several important points must be considered, including input data, type of anomalies, availability of data labels, and anomaly detection output (Chandola et al.,2009). The nature of the input data is one of the essential features of any anomaly detection process. How is the data represented and how are the data types of these representations determined? An input refers to a collection of data instances or observations, each of which each can be described using a set of features (attributes). Moreover, the features can be of binary, categorical, or continuous type. Binary features are represented by two possible values, categorical and continuous features. Categorical features are represented by a categorical number of possible values. For instance, a gender feature may be categorical, with the set of values male and female. By contrast, continuous features are represented by a continuous range of possible values.

In this regard, it is fundamental to figure out some theory regarding the anomalies.

Even though an anomaly is defined in different ways depending on its application, one widely accepted definition was proposed by Hawkins (1980):

Anomaly.- An anomaly is an observation which deviates so much from other observations as to arouse suspicions that it was generated by a different mechanism.

Anomalies can be the result of errors in the data but sometimes they are indicative of a new, previously unknown or underlying process. These anomalies reveal behaviour patterns of the data and convey valuable information that is considered vital in several decision-making systems (Chalapathy and Chawla,2019,?). Another important aspect is the type of anomaly under consideration.

Anomaly Types.- Depending on the nature of the anomaly, it can be broadly classified into three categories: point anomalies, contextual anomalies, and collective anomalies. Figure 2.5 presents the different types of anomalies.

The point anomalies are when single data records deviate from the remainder of the datasets. The contextual anomalies are when the record has behavioural as well as contextual attributes. The same behavioural attributes could be considered normal in a given context and anomalous in another. Finally, collective anomalies refer to

CHAPTER 2. Background

Figure 2.5: Anomaly types. A) Point anomaly: represented byA₁,A₂,A₃; these points are outside from the normal clusters N₁ and N₂. B) Contextual anomaly: represented by the low value of temperature. This pattern is anomalous because it differs from the periodic context. C) Collective anomaly: represented by the horizontal pattern half way along the graph. This pattern is anomalous when compared to previous normal patterns. Source: Image reproduced from (Araya, 2016).

a group of similar data that are deviating from the remainder of the dataset. This can only occur in datasets where the records are related to each other. Moreover, contextual anomalies can be converted into point anomalies by aggregating over the context (Hodge and Austin,2004; Gogoi et al., 2011).

Master Program in Computer Science - UCSP 17

2.2. Anomaly Detection Preliminaries Finally, taking into account the anomaly definition and the anomaly types, the anomaly detection task could be understood as follows:

Anomaly Detection.- Data analysis task with the aim to detect anomalous or abnormal data from a given dataset. It finds patterns in data that were previously absent or overlooked with the aim to provide valuable information that supports the decision-making process (Chandola et al., 2009; Ahmed et al.,2016).

2.2.2 Anomaly Detection Approaches

The availability of data labels is an important feature to address the anomaly detection problem. It refers to the availability of labels referring to each observation as either normal or abnormal. Depending on the availability of data labels, the anomaly detection system can usesupervised,semi-supervised orunsupervised anomaly detection techniques (Hodge and Austin, 2004).

Supervised Anomaly Detection.- It describes the setup where the data comprises fully labeled training datasets (see Figure 2.6). An ordinary classifier can be trained first and applied afterwards. This scenario is very similar to traditional pattern recognition with the exception that classes could be typically strongly unbalanced.

However, this setup is practically not very relevant due to the assumption that anomalies are known and labeled correctly. For many applications, anomalies are not known in advance or may occur spontaneously as novelties during the test phase.

Figure 2.6: Supervised anomaly detection illustration.

Semi-supervised Anomaly Detection.- It assumes the existence of a small amount of labeled data with a large amount of unlabelled data during the training stage (see Figure 2.7). The basic idea is that a model learns to identify normal from abnormal data taking into account the features from a labelled dataset.

CHAPTER 2. Background

Figure 2.7: Semi-supervised anomaly detection illustration.

Unsupervised Anomaly Detection.- It is the most flexible setup which does not require any labels (seeFigure 2.8). The idea is that an unsupervised anomaly detection algorithm scores the data solely based on intrinsic properties of the dataset. Typically, distances or densities are used to give an estimation about what is normal and what is an outlier.

Figure 2.8: Unsupervised anomaly detection illustration.

Labelling each observation in a biomedical dataset is a difficult and time-consuming process. Moreover, the dynamic nature of anomalies makes it difficult to label this biomedical dataset. For instance, in the case of foreign objects in the pelvic bone, the shape and appearance of these anomalies are variables due to several factors such as silhouette, material and location of the object (seeFigure 2.9). In this regard, the labelling task is challenging and in many cases impossible to make if it demands the experience from clinicians or radiologists to have accurate labelling.

The anomaly detection output plays an important role to classify an observation as normal or not. Normally, the output could be of two types: labels andscores. Labels indicate whether an instance is an anomaly or not, and they are a common output of supervised learning. On the other hand, semi-supervised learning and unsupervised learning have as an output the scores, which represent a confident value indicating the degree of abnormality (Goldstein and Uchida, 2016). Finally, taking into account that our dataset does not have any kind of labels we are focused on unsupervised learning to address the anomaly detection problem. This section satisfies most of the stated background. However, it does not cover all the information about unsupervised learning methods that will be addressed thoroughly in the next section.

Master Program in Computer Science - UCSP 19

In document Universidad Católica San Pablo: Unsupervised anomaly detection in 2D radiographs using generative models (página 38-43)