construido y el natural.
SUBSISTEMA VIAL
In this section an explanation of the most significant terminology used later in this thesis is provided. Commencing with terminology used with respect to the proposed data distribution techniques used to distribute class labels between nodes within the binary tree hierarchical classification model, and following on with more general terminology. Three different techniques were considered:
1. K-means clustering. This technique uses the well known k-means clustering algo- rithm to cluster the data into two clusters (k= 2) without eliminating any resulting overlap between the classes such as in the case of the mechanism adopted in BCT- OB discussed earlier in section 2.4 where a cleaning mechanism was adopted to reduce the number of overlapping classes between the two resulting clusters. The conjecture advantage is that the presence of overlapping could mitigate against the early mis-classification issue associated with the hierarchical structure. 2. Divisive hierarchical clustering. This technique adopts a top down “divisive” hi-
erarchical clustering algorithm to cluster the data whereby clusters are repeatedly divided into sub clusters until some stopping criteria is reached. Again, as in the case of the k-means clustering, overlapping is maintained between classes and no cleaning mechanism was adopted to eliminate it.
9Demsar describes this as follows: “Sometimes the Friedman test reports a significant difference but
the post-hoc test fails to detect it. This is due to the lower power of the latter. No other conclusion than that some algorithms do differ can be drawn in this case” [25].
3. Data splitting. This technique comprises a simple “cut” of the data into two groups so that each contains a disjoint subset of the entire set of class labels. More specifically, the data splitting technique splits the data into two disjoint groups without overlap between class labels and without taking into account any similarity between them. For example given a data set with six class labels {a, b, c, d, e, f}, the data splitting might split these into{a, b, c}and{d, e, f}or any other possible balanced split.
In addition, the following terminology used throughout the thesis:
1. Bagging. This term refers to the process of dividing a given dataset into disjoint partitions and learning a classifier for each partition. In other words this refers to one of the “Bagging-Like-Strategies” explained earlier in section 2.4. Efficiency is the main reason behind the selection of this type of sampling. With respect to the work presented in this thesis tree partitions were used. The reason behind adopting three partitions, was so as to avoid the insufficient information issue associated with sampling with partitioning that increases as the number of partitions increases. 2. Generation time. This term refers to the time required to train a classifier using
a given training set comprised of a set of examples.
3. Classification time. This term refers to the time required to classify the examples in a given test set.
4. Directed Acyclic Graph (DAG). This term refers to a graph that comprises a set of nodes (vertices) and directed edges (arcs), where every node has at least one inward or outward edge connecting it to another node in such a way that there are no cycles, in other words there is no sequence of edges starting from a nodeN that eventually loops back toN [21, 95].
5. Rooted DAG. This term refers to a DAG that has exactly one node designated as a root node, a node that has no edges pointing in to it; in other words there is only one node that has no predecessor nodes [81].
2.9
Summary
In summary, this chapter has presented a literature review of the most widely used approaches to solve the multi-class classification problem, namely:
1. Stand-alone classification.
2. Collections of binary classifiers. From the literature, the most significant work is OVO.
3. Ensemble classifiers arranged in: (i) concurrent, (ii) sequential, (iii) binary tree hierarchical form and (iv) DAG hierarchical form.
In the context of binary hierarchical ensemble classification, the previous work can be categorised into two main categories with respect to the adopted technique to distribute class labels between nodes within the binary tree: (i) overlapping between classes at nodes and (ii) no−overlapping between classes at nodes. The work presented in this thesis considered both categories. The previous work on DAG hierarchical ensemble classification has focused on using binary classifiers at nodes rather than groups of classes as proposed in this thesis. As noted earlier in this chapter this can be considered to be a special case of using a set of binary classifiers to solve the multi-class classifi- cation problem. Also it can be considered as a way to combine the results from OVO decomposition.
In addition an overview of clustering algorithms has been presented. The reason for this is that clustering algorithms were utilised to distribute classes between nodes within the hierarchy with respect to the work presented later in this thesis. Also a general overview of the statistical tests used latter in this thesis to compare the different classification models was provided. In the following chapter (Chapter 3) the hierarchical ensemble classification model for multi-class classification based on the usage of a Binary Tree (BT) structure will be presented.
The Binary Tree Hierarchical
Classification Model
3.1
Introduction
The nature of the proposed Binary Tree hierarchical classification approach is presented in this chapter. As noted earlier in the introduction to this thesis, the binary tree hierarchical classifier is a form of ensemble classifier. Each node in the hierarchy holds a classifier. Classifiers at the leaves conduct fine-grained (binary) classifications while the classifiers at non-leaf nodes further up the hierarchy conduct coarse-grained classification directed at categorising examples using groups of labels. To remind the reader of the the Binary Tree hierarchy Figure 2.1 from Chapter 2 is given again in Figure 3.1. At the root we classify into two groups of class labels{a, b, c, d}and{e, f, g}. At the next level we split into smaller groups, and so on till we reach classifiers that can associate single class labels with examples. Note that Figure 2.1 is just an example of the proposed hierarchical model; non-leaf child nodes may end up with overlapping classifications because the adopted clustering algorithms may assign examples belonging to the same class to different clusters. Recall from Chapter 2 that there has been some previous work on binary tree based hierarchical ensemble classification (Section 2.4).
The challenges of hierarchical single-label classification, as conceived in this thesis are: (i) how best to distribute (organise) the class labels between nodes so as to produce a Binary Tree classifier that generates the most effective classifications and (ii) how to address the successive mis-classification issue imposed by the hierarchical structure. To address the first issue this chapter reports on several techniques considered to organise (group) the class labels so as to produce a hierarchy that generates an effective classifi- cation. These were founded on ideas concerned with the use of clustering and splitting techniques to distribute the class labels as noted in Section 2.4. With respect to the second issue a Multiple Path strategy was proposed (facilitated by the probability or confidence values generated by Naive Bayes and CARM classifiers respectively hosted at the Binary Tree nodes). The first issue is discussed further in Section 3.2 where the
Classifier (a, b, c, d) (e, f, g) Classifier (a) (b) Classifier (e) (f, g) Classifier (a, b) (c, d) Classifier (f) (g) Classifier (c) (d) e a b c d f g
Figure 3.1: Binary Tree Hierarchy example.
generation of the DAG ensemble approach is presented in detail. While the second issue is addressed in Section 3.3 where the operation of the proposed approach is presented. Section 3.4 presents an overview of the conducted experiments and the obtained results. Finally, a summary of this chapter is presented in Section 3.5.
3.2
Binary Tree Hierarchical Model Generation
In this section the generation of the proposed Binary Tree hierarchical classification model is explained in detail. Recall that in the proposed model classifiers nearer the root of the hierarchy conduct coarse-grain classification with respect to subsets of the available set of classes. Classifiers at the leaves of the hierarchy conduct fine-grain (binary) classification. To create the hierarchy a classifier needs to be generated for each node in the hierarchy using an appropriately configured training set.
Two classification styles were considered with respect to the nodes in the proposed binary tree ensemble hierarchy: (i) straight forward single “stand-alone” classifiers (Fig- ure 3.2(a)) and (ii) Bagging ensembles (Figure 3.2(b)). With respect to the first style, a simple classifier (of any form Decision Tree, Naive Bayes, or CARM) was generated for each node in the hierarchy. With respect to Bagging, the data setDassociated with each node was randomly divided intoN disjoint partitions and a classifier generated for each (in the evaluation reported in Section 3.4,N = 3 was used).
Figure 3.2: Binary Tree hierarchical classification model,(a)using a single classifier at each node and(b)using a Bagging ensemble at each node.
In order to group (divide) the input dataDduring the hierarchy generation process, three different distribution techniques were considered: (i)k-means clustering, (ii) divi- sive hierarchical clustering and (iii) data splitting. K-means anddivisive hierarchical clustering were both described in Section 2.6 in Chapter 2. Among thesek-means is the most commonly used partitioning method where examples are divided intok partitions (in our model k= 2 was used because of the binary nature of our hierarchies). Hierar- chical clustering creates a hierarchical decomposition of the given data. In the context of the work described in this thesis a “divisive” hierarchical clustering (top-down) was used because this fits well with respect to the vision of hierarchical ensemble classification presented in this thesis. Recall from Chapter 2 that the process commences with all ex- amples in one cluster, on each successive iteration, a cluster is split into smaller clusters until a “best” cluster configuration is arrived at (measured using cluster cohesion and separation measures). The idea behind the use of clustering algorithms is, at each level and branch of the hierarchy, to group the available class labels into two disjoint groups (clusters) so that the classes within each group share some similar characteristics. Data splitting comprises a simple “cut” of the data into two groups so that each contains a disjoint subset of the entire set of class labels. More specifically, the data splitting technique split data into two disjoint groups without overlapping between class labels and without taking into account any similarity between them. For example given a data
set with six class labels {a, b, c, d, e, f}, the data splitting might split these into{a, b, c} and {d, e, f} or any other possible balanced split.
The proposed Binary Tree hierarchy generation algorithm is presented in Algorithm 1. The algorithm assumes a data structure, calledhierarchy, comprised of the following fields:
1. Classifier: A classifier at each tree node.
2. Left: Reference to left branch of the hierarchy (root and body nodes only, set to null at leaves).
3. Right: Reference to right branch of the hierarchy (root and body nodes only, set to null at leaves).
Considering the algorithm presented in Algorithm 1 in further detail. The Gen- erate Hierarchy procedure is recursive. On each recursion the input to the Gener- ate Hierarchy procedure is the data set D (initially this is the entire training set). If the number of classes featured in D is two, a binary classifier is constructed (to dis- tinguish between the two classes) (lines 15-17). The most sophisticated part of the Generate Hierarchy procedure is where the number of classes featured in D is more than two. In this case the examples inD are divided into two groups D1 and D2 each with a meta-class label, K1 and K2, associated with it (line 20). A Classifier is then constructed to discriminate between K1 and K2 (line 21). The Generate Hierarchy procedure is then called again, once withD1 (representing the right branch of the hier- archy) if the number of classes featured inD1 is more than one class, and once with D2 (representing the left branch of the hierarchy) if the number of classes featured inD2 is more than one class (lines 26 and 32).
It is interesting to note that if the clustering algorithms, k-mean or divisive hier- archical clustering, are used to divide class labels between nodes during the generation process; the number of classifiers that will be generated cannot be calculated in advance. While if a data splitting technique is used during the hierarchy generation process, the number of classifiers needed to be trained isN−1, whereN is the number of class labels in a given dataset.