• No se han encontrado resultados

SITIOS ARQUEOLÓGICOS ANALIZADOS EN ESTA TESIS

3-SITIOS ARQUEOLOGICOS ANALIZADOS EN ESTA TESIS

In previous sections, we have described in detail our pipeline argument mining system and compared our proposed models with the baselines for different argument mining tasks. While prediction performance of each argument mining task was reported, those results do not re- flect the true capability of the system because each task was performed using the input with true labels instead of output from the task before. In particular, both argument component classification and argumentative relation identification were fed with true argument com- ponents (AC). In this section, we test the end-to-end performance of our argument mining system.

Considering essays in the test set, argument components are first automatically extracted. Then, the extracted argument components are classified for their argumentative labels (i.e., MajorClaim, Claim, Premise) and pairs of components that hold a argumentative relation are identified. To measure the end-to-end performance, we first form an union set U of extracted argument components E and true argument components T which are missed at the identification task.

U = E∪ T, E ∩ T = ∅

With the extracted argument components E, we assign true argumentative labels to those that have exact matches with true argument components. The other extracted argument components should have true non-argumentative labels (i.e., false positive). Because the true argument components in T are not given to later classification tasks, the creation of U is to assure that the missing argument components in T , and subsequently the argumentative relations among them, are taken into account when measuring performance. Thus, our performance measures for argument component classification and argumentative relation identification embed the performance of argument component identification.

The test set has 1266 true argument components (AC). Our argument component identi- fication (ACI) model returned 3460 textual spans (i.e., sub-sentence portions) in which 1272 were identified as AC. Out of the extracted AC, 941 have exact matches with true AC (i.e., true positive). The confusion matrix is given in Table 28. Our union set U includes 1597 AC in which 1272 were returned by our model (set E) and 325 true AC were misidentified

True argumentative True non-argumentative

Predicted argumentative 941 331

Predicted non-argumentative 325 1868

Table 28: Confusion matrix of argument component identification on the test set. Corpus: Persuasive2.

True MajorClaim True Claim True Premise True Non

Predicted MajorClaim 81 10 1 64

Predicted Claim 15 138 50 95

Predicted Premise 0 77 569 172

Predicted Non 57 79 189 0

Table 29: Confusion matrix of argument component classification on the test set. Corpus: Persuasive2.

as non-argumentative (set T ). We also wanted to mention that approximate match, i.e., two text spans are considered a match if their overlap portion is greater than some thresh- old (Persing and Ng, 2016), should be more favorable for the boundary extraction problem. We, however, use exact match in this study to give a sense of argument mining difficulty. Approximate match may make more sense when we are aware of how much flexibility an end-application allows for argument mining output.

8.5.1 Argument Component Classification

Given the set U, Table29presents the confusion matrix of argument component classification (ACC). The row Predicted Non does not reflect the misclassification by our ACC model, but shows errors carried over from ACI’s results. Our ACC model achieves end-to-end F1 of 0.421 with F1:MajorClaim = 0.524, F1:Claim = 0.458, and F1:Premise = 0.699. Stab and

True Linked True Not-linked

Predicted Linked 252 369

Predicted Not-linked 449 3978

Table 30: Confusion matrix of argumentative relation identification on the test set. Corpus: Persuasive2.

Gurevych (2017) did not report the end-to-end performance of their models so we do not have a baseline for direct comparison. To give more intuition on the task difficulty, here we present the end-to-end measures reported in a study byPersing and Ng(2016). The authors developed a heuristic for argument component candidate extraction and an ILP framework for joint prediction. They conducted 5-fold cross validation in corpus Persuasive1. Essays in the corpus are of the same kind with those in Persuasive2 that we are using for this study (see Chapter 3). Their best system with exact matching returned F1:MajorClaim = 0.169, F1:Claim = 0.374, and F1:Premise = 0.534.

8.5.2 Argumentative Relation Identification

From 1272 argument components returned by our ACI model, our argumentative relation identification (ARI) model formed 4854 ordered pairs of AC in which 621 were predicted as Linked. With regard to 325 true AC which were missed by our ACI model, 189 Linked pairs of AC were not considered as input of the ARI model.

To have an end-to-end F1 for Linked pairs, we add 189 true Linked pairs to the cell [Predicted Not-linked, True Linked] in the confusion matrix. Thus, the confusion matrix in Table 30has 189 more instances than the total number of pairs formed by our ARI model. With this adjustment, our ARI model obtained F1:Linked = 0.381. Persing and Ng (2016) achieved F1 = 0.136 using corpus Persuasive1, but their task was more difficult when it classified Support, Attack and No-relation.

roughly compare the end-to-end F1 scores with the results of individual tasks in previous sections. We observe a great reduction in performance with our end-to-end setting. For example, F1:Linked has decreased 30% while F1 of ACC has reduced nearly 50%. Despite the fact that argument component identification could obtain high performance (about 1.5% lower than human upper bound), the performance degradation in end results are remarkable which shows the essential value of a good ACI model.

8.6 SUMMARY

This section presents the end-to-end performance of our pipeline argument mining system in the corpus Persuasive2. The reported performances are promising but show need of improvement. Our plan for enhancing our argument mining system includes improving the ACI model and implementing joint prediction. We also suggest to use approximate match for ACI to increase model coverage when applying argument mining to a real task.

9.0 AUTOMATED ESSAY SCORING: AN EXTRINSIC EVALUATION OF