• No se han encontrado resultados

CONFETEX DE COLOMBIA S.A.S

3.5 MARCO SITUACIONAL

The task of classification is to learn a function that maps data objects to their correct class(es) in a predefined class set. A classifier learns from a so- called training set, containing a sufficient number of already mapped objects for each class. The training objects are considered to be ”labelled” with the name of the class they belong to. Classification is also called supervised learning because it is directed by these labelled objects. Formally, a classifier is a function of the following form:

Definition 2.21 (Classifier)

LetO be the set of objects,C the set of classes and letG:O →C be the true mapping of all objectso to their correct classco. Furthermore, letT ⊂O×C

withT ={(o, c)|o∈To ⊂O∧G(o) =c}be the set of already labelled training

objects. Then a classifier is a function CLT :O →C that maps the objects

of O to a class ci ∈O.

Gis also called the ground truth. The goal of classification is to trainFT in a way thatCLT can reproduce the ground truth G as good as possible.

One of the most important aspects for evaluating classifiers is that the quality of prediction for the objectsoi ∈ To is not significant for the perfor- mance observed for the objects oj ∈ O \To. Since the correct class for the objects in To is already known, it is easy to find a classifier that achieves

maximum performance on the training data. However, for reliable class pre- dictions on unknown data objects o ∈ O\To a classifier has to offer good generalization. The key to build up a good classifier is to find out which of the characteristics are significant for a class and which are typical to individual data objects. If the classifier is based on too many individual characteristics, it will fit too accurately to the elements of To and the performance for new data objectso∈O\To degenerates. This effect is known asoverfitting and is one of the central problems of classification. A good theoretical description to this problem is found in the introduction of [Bur98].

To measure classification performance without overfitting, the set of ob- jects where the correct class is already known To, is split into a training set

T R and a test setT E. The training set is used to train the classifier. After- wards the elements of the test set are classified and the following measures for the classification performance can be calculated:

• Classification Accuracy Acc(FT R) = |{o|G(o) =FT R(o)∧o∈T E}| |T E| • Precision P recision(FT R, c) = |{o|G(o) =FT R(o) =c∧o∈T E}| |{o|CLT(o) =c}| • Recall Recall(FT R, c) = |{o|G(o) = FT R(o) =c∧o∈T E}| |{o|GT(o) = c}| • F-Measure F −M easure(FT R, c) = 2·P recision(FT R, c)·Recall(FT R, c) P recision(FT R, c) +Recall(FT R, c)

The classification accuracy is a performance measure considering all classes. It is the percentage of correct predictions for the test set T E. However, if the number of test objects for each class varies very strongly, considering the accuracy tends to be misleading. Consider a test set for two classesA andB

that consists of 95 % objects for class A and only 5% percent of the objects belong to classB. By always predicting classA, it is possible to achieve 95 % classification accuracy for this test set without having a reasonable classifier. This example illustrates that considering the accuracy is only advisable if the number of test objects in T E is approximately similar for each of the classes. Additionally to the accuracy, the most important measures for clas- sification performance are precision and recall. The precision for a class c

indicates the percentage of correct classified objects among the objects that where predicted to belong to class c. The recall for class c is the percentage of correctly classified objects among all objects that really belong to classc. Naturally, there is a trade-off between precision and recall. Most classifiers can be adjusted to increase the precision of a class c while decreasing its recall or the other way around. To have a measure considering both aspects, the f-measure was introduced. The f-measure is the harmonic mean value of precision and recall and reaches its maximum when both measures reach the same value.

Another problem of testing classifiers is that the set of already labelled instances T usually tends to be very limited. Thus, training the classifier with only a subset of T will decrease the classification performance. On the other hand, it is not possible to measure the classification performance cor- rectly without already labelled data that are not part of the training set. To limit this problem, the technique of stratifiedk-fold cross validation was introduced. First of all, stratified k-fold cross validation divides T into k

stratified folds. Each stratified fold contains approximately the same per- centage of objects from each of the classes as it is found in the complete set

T. Now the classifier is trainedktimes withk−1 folds and each time another fold is left out for testing. Thus, for each run and for each fold there is an

fold 1 1 2 a b 3 c test set TE1 classifier training set TR1 classification results 1 2 3 a b c fold 2 1 3 a c 2 b classifier fold 3 2 3 b c 1 a classifier global classification result test set TE2 test set TE3 training set TR3 training set TR2 classification results classification results

complete training set TR

Figure 2.10: Illustration for 3-fold cross validation.

classification result that can be used to calculate the introduced performance measures for the complete data setT. Figure 2.10 illustrates the building of stratified folds.

Additional to the quality of prediction, there are other important aspects of a classifier. Especially for database application efficiency is an important aspect. The efficiency of a classifier is measured by the time it takes to clas- sify a new unlabelled object. This so-called classification time is important, since a classifier is considered to be applied to large numbers of objects, with- out being modified. The time that is spent for the training of a classifier, the so-called training time, is considered as less important in most cases, because the training set is considered to consist of a minor number of objects only. However, if the training of a classifier is very time consuming, the complete KDD process is slowed down. Therfore, the time to train a classifier has to be considered for several application as well. The last important aspect of classification is the understandability of the found class model. Though accu- rate classification is the primary goal of classification, for many applications explicit knowledge about characteristics of the treated classes are needed. Thus, providing class models that are easily understood by human users is

an important feature of a classifier. Unfortunately, many of the established methods do not provide this feature.

Documento similar