• No se han encontrado resultados

2. Objetivos de la investigación

1.1.11 Las 5 fuerzas de Porter

Several applications for the automatic analysis of textual documents can require hu- man interaction, in order to improve their effectiveness (for example e-discovery, de- scribed in the Introduction). A well known scenario isrelevance feedback in informa- tion retrieval [54, 68], where the user marks the documents returned by a retrieval system, in order to enrich her information need and formulate a new and more ac- curate search. In [38] a modern approach to relevance feedback is discussed, which recalls our principle of maximizing a measure of utility. The authors develop a ML algorithm for the diversification of the results, in order to optimize the utility of the results in terms of user satisfaction. On the retrieved documents, the system balances values of relevance (for meeting the user needs) and informativeness (for improving the model with feedback).

Crowdsourcing applications are widely used from both the scientific community

and the industry (e.g., Amazon Mechanical Turk7) in order to obtain labelled data

and building robust datasets. The problem of combining the collaborations of several annotator has been studied with the purpose of understanding the quality of the human work [3, 34, 39]. In SATC we assume that the validation is correct, discarding problematics as inter-agreement or expertise of the annotators. The human cost is one critical aspect in crowdsourcing, in our methods we have assumed that the cost is linearly proportional with the number of documents to be validated.

In machine learning different scenarios exist, in which automatic processes and human interactions are combined; they are not limited to active learning or training data cleaning. The following are some examples of applications.

A software for form filling is presented in [41]. The described system allows the users to fill forms (e.g. web forms), supporting this operation with automatic filling

7

CHAPTER 3. A RANKING METHOD FOR SATC

of the empty form fields, employingconstrained conditional random fields. The effec- tiveness of the system is measured in terms of expected user actions, so the better solution is the one which minimizes the user writings, and the corrections of wrong information. The software can also highlight the form fields which it has automatic filled, and on which is less confident, thus reducing the ratio between user actions and filling accuracy.

In [26], a particular software architecture for building training data is described, it produces a sort of automatic annotation of the unlabelled data. The goal of the appli- cation is learning from a training set with few positive samples and many unlabelled samples, so to reduce the human effort in annotating documents. After sampling po- tential negative samples, a method for selecting samples to be corrected is applied, and an AL strategy is used in order to assign the correct classes. Uncertain samples are extracted through the recomputation of a separation margin, learned with SVMs. Weeber et al. [81] examine the assessor disagreement in annotating training and test data. They evaluate different cases, and for each case they compare the accuracy of classification between different scenarios of annotation, e.g.: different or same assessors for training and test set, two groups of assessors for training set, etc. One interesting outcome of this study is in the measurement of the human effort necessary for the annotation, when a specific level of recall is requested, but different assessors are used. The problem is approached by ranking documents according to the probabilistic outputs of a classifier.

Coactive learning [75] is a model of interaction between a learning system and

its user. The authors start from the conjecture that the user feedback gives an im- provement of the prediction but not necessarily the optimal. The system integrates a measure of the utility of the results, the goal of the learning process is minimizing the difference between the utility of the results annotated after user feedback and the utility of the true labels. Given the assumption that human annotation may not imply total improvement of accuracy, the algorithms learn from the annotated data in order to maximize the utility of the user feedbacks of successive annotations.

3.7 Conclusions

We have presented a method for ranking the documents labelled by an automatic classifier. The documents are ranked in such a way as to maximize the expected reduction in classification error brought about by a human annotator who validates a subset of the ranked list and corrects the labels when appropriate. First we have introduced a probabilistic approach to the task of semi-automated TC, then we have defined our method, based on the concept of utility in validating a ranked document. We have also proposed an evaluation measure for such ranking method, based on the notion of expected normalized error reduction. We have introduced this measure in three steps, starting from the concept of error reduction, then defining a normalized error reduction, finally formalizing the expected error reduction. This measure is

3.7. CONCLUSIONS designed with the aim of modelling the behaviour of the user who validates the ranked documents.

We have pointed out the details of the implementation and the experimental set- ting. Experiments carried out on standard datasets show that our method substan- tially outperforms a state-of-the-art baseline method.

It should be remarked that the very fact of using a utility function, i.e., a function in which different events are characterized by different gains, makes sense here since we have adopted an evaluation function, such asF1, in which correcting a false positive or a false negative brings about different benefits to the final effectiveness score. If

we instead adopted standard accuracy (i.e., the percentage of binary classification

decisions that are correct) as the evaluation measure, utility would default to the probability of misclassification, and our method would coincide with the baseline, since correcting a false positive or a false negative would bring about the same benefit. The methods we have presented are justified by the fact that, in text classification and in other classification contexts in which imbalance is the rule,F1is the standard evaluation function, while standard accuracy is a deprecated measure because of its lack of robustness to class imbalance (see e.g., [73, Section 7.1.2] for a discussion of this point).

In the next chapters we will extend the method exploring new intuitions about the ranking function and its applications. A task of SATC can be investigated through its different dimensions, we will follow some of the most interesting ramifications of the problem.

4

Additional Ranking Methods for Semi-Automated

Text Classification

In this Chapter we extend the development of utility-theoretic ranking methods for SATC. In Chapter 3 we defined a SATC method that has a “static” nature, i.e., utility gains are computed only once at the beginning of the SATC process.. We now come up with new insights, expanding our work in two directions: (a) we present a new “dynamic” ranking method, in which the gains are computed after each step of the manual validation, with the aim of obtaining a more accurate estimate of the improvement in accuracy, iteratively (Section 4.1); (b) we switch to another way of evaluating the classification accuracy, we reformulate our ranking methods for the micro-averaged effectiveness (Section 4.2).

We present the two ranking methods based on the already discussed concepts of utility and gain, we show how these functions can be modified in order to meet our needs. For each method we reproduce the experimental protocol of Chapter 3, expressly modified for our purposes, and we discuss the effectiveness of the new SATC methods.