In the previous sections, we have investigated various types of user actions as interest indicators. Now we want to apply machine learning methods for predicting relevance based on this user data. In case these methods would work well, the predicting method can be used as input to standard relevance feedback methods, thus implementing im- plicit relevance feedback. For this purpose, we use the systems RapidMiner1 and R sys- tem [R Development Core Team, 2006] for automatic classification, where each instance be- longs to one of the classes ’relevant’ or ’nonrelevant’.
Training and Testing
For classification, normally the data is divided into two sets, i.e. training and test. The classifier is trained on the training set. To predict the performance of a classifier, we need to assess its error rate on a dataset that played no part in the formation of the classifier. This independent sample is called the test data. The classifier predicts the class of each instance: if it is correct, that is counted as success; if not, it is an error.
A more general way to mitigate any bias caused by the particular sample chosen is to repeat
the whole process, training and testing, several times.
10-fold Cross-Validation
In 10-fold cross-validation [Witten and Frank, 2005], the original sample is partitioned into 10 subsamples. Of the 10 subsamples, a single subsample is retained as the validation data for testing the model, and the remaining 9 subsamples are used as training data. The cross- validation process is then repeated 10 times (the folds), with each of the 10 subsamples used exactly once as the validation data. The 10 results from the folds then can be averaged to produce a single estimation. The advantage of this method over repeated random sub-sampling is that all observations are used for both training and validation, and each observation is used for validation exactly once.
If there are very few instances of one class in a dataset, there is a chance that a given fold may not contain any of this class instances. To ensure that this does not happen, stratified 10-fold cross-validation is used where each fold contains roughly the same proportion of class labels as in the original set of samples.
Support Vector Machine (SVM)
Support vector machines (SVMs) are a set of related supervised learning methods used for classification and regression [Witten and Frank, 2005]2. Viewing input data as two sets of vectors in an n-dimensional space, an SVM will construct a separating hyperplane in that space, one which maximises the margin between the two data sets. To calculate the margin, two parallel hyperplanes are constructed, one on each side of the separating hyperplane, which are ”pushed up against” the two data sets. Intuitively, a good separation is achieved by the hyperplane that has the largest distance to the neighboring datapoints of both classes, since in general the larger the margin the better the generalisation error of the classifier. In our experiments, we used SVMs with so-called linear kernels.
Decision Tree
In a decision tree each inner node corresponds to one attribute, each branch stands for one possible value of this attribute (numeric values have to be discretized first), and in the classifi- cation process, an instance walks through the tree by starting from the root and following the branches according to its attribute values: when a leaf node is reached, the class corresponding to this leaf is assigned [Witten and Frank, 2005].
Metrics
In order to measure the classification quality, we use the accuracy measure. The contingency table shows the four different cases of combinations of classifier prediction and human judge- ment.
True positive (TP) An instance is correctly predicted as true. This is a correct classification.
False positive (FP) An instance is incorrectly predicted as yes (or true) when it is infact no (negative).
False negative (FN) An instance is incorrectly predicted as no (or negative) when it is infact true (or yes).
True negative (TN) An instance is correctly predicted as false. This is a correct classification.
Relevance Human Judgement
Yes No
Classifier Yes TP (true positives) FP (false positives) Judgement No FN (false negatives) TN (true negatives)
Table 9.1: Contingency table for a class
An evaluation measure is the Accuracy a which is defined as the ratio of the amount of correct classification assignments to the amount of all classification assignments
a = T P + T N T P + T N + FP + FN
Experimentation
Although we have multi-valued relevance scales, we want to predict only binary relevance in our classification experiments (since this is already hard enough, as we will see). For this purpose, we consider two different interpretations of relevance, which we call strict and loose. Furthermore, we investigate relevance predictions both at the level of single element and at the document level. In the element-based approach, a strict interpretation regards only fully relevant items as relevant, and a loose one where everything that was not judged as ‘not rele- vant‘. In the document-based approach, the average of the relevance judgements per document is considered. Therefore the average relevance ranges from 0 to 3 for iTrack 2005 and from 0 to 4 for iTrack 2006-07. Different ranges for strict and loose are defined as follows:
iTrack 2005: loosely relevant = relevance > 0.5 strictly relevant = relevance > 1
iTrack 2006: loosely relevant = relevance ≥ 1 strictly relevant = relevance ≥ 3.5
Using these definitions, classification experiments were performed both with the decision tree and the SVM method. The tables 9.2 and 9.3 show the resulting accuracy values for the two different relevance interpretations, the two iTracks and for element and document-based approaches. Different features are considered individually and also altogether.
Here ‘baseline‘ denotes the case where the majority class is assigned to each instance. Accu- racy values are printed bold if they are at least 1% higher than the baseline and in italics if they are at least 0.5% better. Differences significant at the 95% level are marked with a ∗ and at the 99% level are marked ∗∗.
Overall the classification accuracy is modest in comparison to the baseline. For the element- based approach, we get improvements only for loose interpretation of relevance in both iTracks, by using the decision tree method. The results for the different features show that hardly anything but number of clicks helps in predicting relevance.
For the document-based approach, we regarded both averages and sums of element-wise fea- tures reading time and overlap. Here the results for iTrack 2006-07 show no improvements at all over the baseline. In contrast, the iTrack 2005 experiments show improvements for both interpretations of relevance. For the strict interpretation, the accuracy gain is quite small and seems to originate from number of clicks and the reading time. The highest improvements have been achieved with the loose interpretation, where the overall reading time seems to be the most indicative feature.
An alternative way of looking at individual features is the computation of information gain; the corresponding results are shown in tables9.4and9.5. For the element-based classification, all information gain values are rather small (< 0.1). In the document-based view, we get somewhat higher values, especially for iTrack 2005, where the sum of reading times is the strongest indicator for both interpretations of relevance.
Overall, the classification experiments have shown only small improvements over the baseline. For iTrack 2006-07, there was no accuracy gain for the document-wise view, and only about 1.6% improvements for the element-wise view. Presumeably this poor result is due to the heterogeneous structure of the Wikipedia collection.
iTrack 2005 iTrack 2006-07 strict loose strict loose
baseline 65.96 66.51 61.43 75.94
all with svm 65.96 66.51 61.43 75.94 all with decision tree 66.34 70.52∗∗ 62.13 77.58∗
clicks 66.03 67.37 61.44 76.63
reading time 65.76 66.48 61.44 75.95
overlap 66.03 66.48 61.44 75.95
hyperlink - - 61.44 75.95
text highlighting - - 61.44 75.95
Table 9.2: Element-based accuracy percentage for iTrack 2005 and iTrack 2006-07 and two relevance interpretations
In the iTrack 2005, the best results were achieved with the loose interpretation of relevance. The most indicative feature is the reading time, especially for the document-wise view. Com- paring document vs. single elements, we see that hardly any single feature seems to be in- dicative for the element-based view, only the combination of features leads to a noticeable improvements over the baseline. For the two interpretations of relevance, the strict one seems to be the harder one to predict.