Objetivos Específicos
Lineamiento 3. Acceso efectivo a mecanismos de remediación
KDD, as defined in [66], is the nontrivial extraction of implicit, previously unknown, and potentially useful information from data. In [66] a knowledge discovery frame- work is also proposed consisting of domain knowledge, a database, discovery processes (methods, searches and evaluation) and a user interface to communicate the discovered knowledge. In [56] the above definition of KDD was refined as a process that involves using a database together with any required selection, pre-processing, subsampling, and transformations; applying DM methods (algorithms) to identify patterns within it; and evaluating the result of the DM to identify the subset of the enumerated patterns
deemed to describe new knowledge. The advances of secondary1 and tertiary2 storage capacity have resulted in much more data being gathered. The size of a repository can reach up to terabytes, with a variety of data formats from structured data, such as nu- merical data, to more complicated types such as multimedia data. There is considerable interest in mining this data to discover new knowledge.
This section presents an overview of KDD commencing in Sub-section 3.3.1, with the general KDD process. Sub-section 3.3.2 then considers the DM KDD sub-process in some further detail. Some current general issues concerning KDD are then discussed in Sub-section 3.3.3.
3.3.1 The Knowledge Discovery in Databases Process
Lots of KDD process models have been proposed to provide guidance for KDD practi- tioners. The very first model was proposed by [56] before it was improved or modified by others. The author of [40] has made an effort to compare a number of models orig- inating from within academia and industry. Most models have a similar sequence of steps: (i) selecting and understanding the application domain, (ii) understanding the data, (iii) preparation of data, (iv) DM, (v) post-processing of the discovered knowledge and (vi) deployment of results. It is worth noting that this process is iterative and time consuming with many loops. Figure 3.4 shows the functional steps in the KDD process as suggested in [56, 130]. With reference to this model each step is described in more detail below according to the descriptions presented in [24, 56, 130]:
1. Selecting and understanding the application domain. Learn the relevant prior knowledge or business objectives and requirements in order to understand the goals of the end user of the discovered knowledge. The output of this stage is the end goals expected by end users.
2. Data selection. Select the appropriate subset of data according to the identified end user goals.
3. Data pre-processing. Improve the quality of the data using basic operations, including noise removal and the handling of missing values.
4. Data transformation. Recast the input data into a form appropriate for the ap- plication of DM. This can be achieved through several operations such as feature extraction, selection and attribute transformation. The outcome of this stage is a set of feature vectors extracted from the pre-processed data in a form that allows DM techniques to be directly applied.
1
Secondary storageis a storage other than the computer memory (e.g. hard disks and flash drive).
2Tertiary storage is a third level storage used to store a mass and archive data. The data access
Selected data Data Preprocessed data Transformed data Pattern Knowledge Data selection Data preprocessing Data transformation Data mining Interpretation/ visualisationtion
Figure 3.4: KDD process functional steps [56, 130]
5. Data mining (DM).Apply DM methods and/or algorithms to the identified fea- tures to discover patterns of interest. Examples of methods that may be used for pattern extraction include neural networks, clustering and rule generation. Examples of the sorts of patterns that may be identified include association rules and decision trees.
6. Interpretation and visualisation. Analyse the discovered knowledge in terms of “interestingness”, verify the discovered patterns using domain experts and possi- bly the use of visualisation tools. As a consequence it may be necessary to return to any one of steps 1 through to 5.
7. Put the discovered knowledge into use. Incorporate the newly discovered knowl- edge into the existing domain (and document it).
3.3.2 Data Mining
Data Mining (DM) is a generic term used to describe processes for extracting knowledge from large amounts of data. Some authors consider DM to be synonymous to KDD, while others see it as a step in KDD process [56, 83]. In this thesis, DM is viewed as an essential step within the overall KDD process (see Figure 3.4). DM may be applied to different domains, such as business, medical and telecommunications; or to single or multiple databases; with different goals. Thus, different types of DM techniques have been identified to reflect the nature of their domains of application, hence web mining, multimedia mining, graph mining and so on. The work described in this thesis is focused
on image mining. The earliest work on object recognition in image databases included SKICAT and JARtool [57]. However, the term “image mining” was first introduced in [147] where a DM algorithm was presented to find association rules according to image content. In the remainder of this thesis, the term image mining is used to represent the application of DM techniques to image data.
From the literature we can identify a number of different DM objectives, the most common are: frequent patterns mining, clustering and classification. Frequent patterns mining is directed at the identification of patterns that occur frequently across a data set [83]. It plays an important role in the discovery of interesting relationships between data [82] and especially in Association Rule Mining (ARM).
Clustering is concerned with the grouping of data into “clusters” so as to maximise the similarity between data within a cluster, while at the same time minimising the similarity between data in different clusters. Examples of learning algorithms that perform clustering include k-Means, where data is assigned to one of the k clusters, and DBSCAN [53] where the number of clusters is not pre-specified.
Classification is directed at generating a representative model of the given data, called a classifier, that can be used to assign class labels to new data. As such classifier generation (learning) requires training data that has class labels associated with it. It is interesting to note that cluster definitions can also be used for classification pur- poses, however clustering does not require the provision of pre-labelled training data. Classification is therefore sometimes referred to as supervised learning, while clustering is sometimes referred to as unsupervised learning. A great many classification algo- rithms have been proposed using many different techniques including artificial neural networks and decision trees. The work described in this thesis is directed at classifica- tion, particularly image classification and this is therefore discussed further in Section 3.4.
3.3.3 Issues in Knowledge Discovery in Databases
There are many KDD issues that have been explored and discussed by other researchers. The input to a knowledge discovery system is some data repository, either a conven- tional relational database or some alternative less conventional form of data such as text, graphs or images. The quality of the output from any KDD process is dependent on the quality of the input. Real world databases are usually dynamic, and may con- sist of hundreds of fields and tables and large numbers of records; but are likely to be incomplete and contain noise and errors. Most importantly, the data used for KDD should be accurate and as cohesive as possible. Listed below are the most significant KDD issues with respect to the work described in this thesis [24, 56, 66, 83]:
1. Data quality. In real world KDD scenarios, integration of distributed databases is a common requirement. Matching records across different databases to form
a single record poses a serious concern. Data verification and validation has to be conducted to ensure the mixing of these records is correct. With respect to image data, the image acquisition process may affect the colour variation of the image due to factors such as lighting and/or the subject’s movement. A pre- processing task to remove such variations can be used to (at least partially) solve this problem.
2. Noise and errors. Noise is a random error or variance in a measured variable [83]. It is commonly caused by attributes values that are apparently random. Examples of noise, with respect to image data, include common objects that exist in different classes, such that by removing the objects from the images would not affect the classification performance. As for errors, which might be caused by internal or external factors, the simplest solution is to filter them out.
3. Relative values. With respect to image datasets (the focus of the work described in this thesis) no absolute values can be provided as different images may produce the same value for an attribute but with different meaning, as the value will be dependent on the context of the image. For example, a grey-scale value of 50 may appear darker than a grey-scale value of 70 if the surrounding contexts are all very bright [97].
4. High dimensionality. Today multi-gigabyte databases are commonplace. These databases consist of large numbers of records, fields and attributes. The problem with high dimensionality in databases is that the search space will grow expo- nentially with the increase in the number of attributes [56]. This may affect the KDD performance in terms of time. There are also possibilities that irrelevant or invalid patterns are discovered by the DM process. One solution to this issue is to apply dimensionality reduction methods that use prior knowledge to filter out irrelevant attributes [56].
5. Knowledge filtering. Overfitting is one common problem in the context of clas- sification. It happens when a classification algorithm generates a model from a limited set of data such that the model is too precisely fitted to the data [56]. This problem can be at least partially resolved by adapting pruning and statistical strategies.
With respect to the work described in this thesis, a number of necessary measures have been taken to solve the above listed issues. Measures to reduce the negative effect of low data quality, noise and relative values are described in details in Chapter 4 (Image pre-processing). Various actions to counter the effect of high dimensionality and knowledge filtering are presented in Chapters 5, 6 and 7.
Classifier generation Image pre- processing Feature extraction & selection Image acquisition
Image dataset Enhanced images
……… ……… ……… ……… Features Test set Training set Classifier Image classification Learning step Evaluation step
Figure 3.5: Image classifier generation process [9]