From the great amount of available machine learning algorithms, three main directions can be identified: classification, clustering, and association. All three shall be shortly introduced here. Afterwards, those algorithms that have either been used in similar tasks
47The goal is to avoid checking every possible combination of vertices, by restricting the possible pairs
(see Chp 4.2, Chp. 5.4.2, and Chp. 6.3.1) or seem to fit the requirements for the task of this thesis will be explained in more detail.
Machine learning algorithms, be they clustering, classification, or association algo- rithms, work on instances. These instances represent events or entities that are thought of as manifesting some underlying patterns. An instance consists of a number of (mean- ingful) features, a so called feature vector, that can be used to distinguish, relate, or compare instances.
Machine learning algorithms can be either supervised or unsupervised. Supervised approaches are based on a labeled data set that has been (manually) classified or measured in some way. The set of instances is split into two subsets, one used to train the algorithm and one for testing, used to evaluate the formula on previously unseen data. A special case of validation is the x-fold cross validation, where the whole set is split up in x subsets
of the same size and in each iteration x−1x sets are used for training and the remaining 1x
of the whole set for testing/evaluating.
Unsupervised approaches are not based on a training set, but are meant to find mean- ingful distinctions or similarities between the instances of the data themselves, without a manual annotation.
Classification tasks are supervised and work on a predefined set of already classified instances. Often the class is binary (e.g., classifying emails as either spam or not-spam) but it can also have multiple classes, e.g., in number recognition the elements have to be assigned one of the digits from 0 to 9. The instances have one class attribute. The goal of classification is to learn or identify non-obvious rules of attribute combinations that predict the class of each instance. The necessary rules can be obtained using different algorithms: from decision trees to support vector machines. New instances, data not previously learned in the training set can then be classified using these rules.
Clustering algorithms are unsupervised and used to form (meaningful) subsets of in- stances of a data set. These clusters can be previously unknown. The task is to find patterns that make it very likely that different instances are somehow connected. Two different kinds of clustering algorithms are of relevance here: The first group is used to cluster networks; that is, it is used to identify densely connected communities within a graph. The second kind works on feature sets and can be thought of as clustering similar vectors. If one thinks of instances as vectors in a vector space, one can also calculate the distance between two vectors, using the Euclidean distance or other calculations like the cosine similarity. Those vectors with a small distance form a cluster. In the end it
is a question of data representation, either as a network or as a set of vectors, which algorithms can be used. Most data can be represented in both fashions.
Association techniques are unsupervised and most often used in marketing and sales contexts. To my knowledge association rules were first described by Agrawal et al. (1993) and applied to sales data to show “an association between departments in the customer purchasing behavior” (Agrawal et al., 1993, p. 215). Association rules do not classify instances; they are, in a way, similar to clustering approaches though: They do not look at the properties of single instances themselves, but they look for patterns of their co- occurrence. Looking at marketing contexts, the idea is to find any kind of association rule between items that are, for example, often purchased together or by the same customer. This information can be used to build a recommendation system, for example.
The most important measures of quality in evaluating machine learning models are the precision, the recall, and the accuracy. These measures are taken from information
retrieval (IR) and can be applied most successfully to classification tasks. They are
computed using the fractions of correctly identified, or classified, instances versus those that were not found.
Table 7: Evaluation in IR: true positives, false positives, true negatives, and false nega- tives.
Assigned class
false true
Actual class false fn tn
true fp tp
The documents the system regards as relevant are called hits. In IR, the precision is the fraction of relevant hits divided by the number of total hits. As can be seen in Table 7, the relevant hits are called true positives (tp); the number of irrelevant hits are called false positives (f p). The documents the system correctly identifies as irrelevant are called true negatives (tn). Those documents that the system incorrectly regards as irrelevant are called false negatives (f n).
classified instances divided by sum of correctly and incorrectly classified instances:
precision = tp
tp + f p. (10)
Recall in IR refers to the coverage of the correctly identified hits divided by number of relevant documents (i.e., those that are correctly classified (tp) and those that are incorrectly regarded as irrelevant (f n)).
recall = tp
tp + f n (11)
Most systems tend to have either a high precision but a low coverage (recall) or tend to have a high coverage with a low precision. To estimate how well the system is balanced, the so-called F measure is used:
F = 2 ∗ precision ∗ recall
precision + recall. (12)
The accuracy is the number of documents that are correctly classified (tp and tn) divided by the total number documents (i.e., the sum of tp, tn, f p, and f n):
accuracy = tp + tn
tp + tn + tp + f n. (13)
In IR as well as in classification tasks, the goal is to find a system or model that provides both high precision and high recall. For a multiclass classification task, the calculation of precision, recall, F -measure, and accuracy has to be done for each class separately. The results can then be averaged.