• No se han encontrado resultados

5.3 Vías de explotación y experimentación

5.3.6 Límites del sistema

7.1 Appraisal

The evaluation of natural language processing systems presents a range of problems and has become a field o f research in its own right (Galliers and Sparck Jones, 1993). Evaluation of the H ^torica system is confounded by several further problems. In the first instance, it is not currently applied in any specific domain: Galliers and Sparck Jones make repeated use of case studies - often in information retrieval - to examine evaluation methods. In such situations, there are a number of metrics which can be employed in judging the efficacy of a system. Without an application, evaluative study becomes that much more difficult. Secondly, as Galliers and Sparck Jones point out (pp98-99), evaluation for NLG systems is particularly difficult - the only possible route they mention is “to evaluate NLG systems in the context of their application”, leaving more generic theories of text organisation difficult to evaluate.

Û ^torica suffers further from the fact that it does not generate the surface text, but an annotated deep structure which then requires realisation by subsequent processing not addressed in this work. Evaluating S^fktorica on the basis of output text thus becomes more difficult and less reliable.

There are a number of questions which can be asked in trying to determine the level of

H ^torica's performance. The two key issues which have run throughout the work, motivating design decisions at a variety of levels, are the notions of coherency and persuasive effect, and these can be employed in examining system ouQiut. Does 5 ^ to iic a produce text which is coherent? Does il^ietoTica

produce only text which is coherent? The first question is the easier to establish: in the two examples discussed (the first in §3.3, the second - the vegetarianism argument - in §4.4 and §5.5), the result of the system is extremely close to the structure of the original natural language; since these are coherent,

üi^torica has demonstrated that it is at least capable of generating coherent structure. But does

^(fietOTiai ensure that the structure it generates is coherent? There are two problems which beset attempts to answer this question: (i) as discussed in §3.1.4, there is no definition for coherency, so, it is impossible to reason a priori about the coherency of structures ^Rfietorica may produce - instead, a

highly labour intensive (and non-automatable) task of generate-and-test is called for; and (ii) coherency is not dichotomous, but scalar (also discussed in §3.1.4), so results can at best be graded, and at worst - and in practice - only be compared with other, natural, coherent versions with the same propositional content. Because such validation is so expensive, only limited work has been conducted in this direction. However, these limited results are promising: both with the examples discussed above, and in several other small arguments taken from the corpus, Hifietorica creates structures which are not only coherent, but also highly similar to those of the natural arguments. Finally, the standard which

Ü(^toriai must achieve in order to create only coherent structure is actually rather low: as Cohen (1987) points out, introduction of appropriate clue words can repair significant structural impairment (indeed, Cohen tentatively suggests that any incoherency in structure can be rectified by clue word repair). Structural coherency, then does not in itself represent a suitable vehicle for system evaluation.

Since !^FietoTica focuses on building arguments to persuade, the persuasive effect of a text might be employed as an evaluative metric. But clearly, this approach too suffers from major problems. Quite apart from problems equivalent to those which beset coherency (particularly that it is the structure not the form which must be evaluated), there are even greater difficulties in assessing persuasive effect. As explained in §2.2.1, persuasive effect as construed by both Perelman (Perelman and Ohlbrechts-Tyteca, 1969) and Freeman (1991) is contingent upon a specific audience; any evaluation must therefore be with respect to some audience. Even if an appropriately detailed characterisation of an audience were available, it still remains doubtful whether an objective assessment of persuasive effect (on that audience) would be possible. One option might be to consider whether or not the argument is successful in altering the beliefs of its audience (assuming that is its goal); this is too crude an estimate however, since (i) it is possible for a highly persuasive argument to fail to effect an absolute belief change, and (ii) it does not admit of intermediate levels o f persuasion, and it seems inappropriate to consider persuasive effect as simply present (successful belief change) or absent (no belief change). As with coherency, it is possible to peek at the persuasive capabilities of

^R^torica by comparing system output with arguments known to have been persuasive, but, as with coherency, the limited scope and arbitrariness of such validation limits the utility of the approach (though, as with coherency, in the few examples examined, iRfietorica performs extremely well).

One final avenue open for investigation is a more empirical, more objective assessment of the overall effectiveness of an argument: direct experimentation. By studying the response of a large sample to a variety of textual arguments, some generated by R^torica, some extracted from a natural environment, it should be possible to assess the competence of the Rfietorica system. One of the key problems with this attractive approach is in the complexity of conducting experimental studies dealing with such abstract phenomena as ‘persuasiveness’ (witness, for example, the conflicting surveyed in (McGuire, 1969)). Constructing an assay to evaluate the functionality of R^torica whilst eliminating as many confounding features as possible - and then conducting a large scale experiment with concomitant statistical analysis - is not only ambitious but also premature. In the first place, it is unclear what would constitute a good (exhaustive, weighty) list of external parameters to be controlled for. Secondly, there is a broader issue that NLG systems have not traditionally been evaluated by human centred experimentation, and so the utility and applicability of the approach is far from certain.

Vn. PERORATION 151

For these reasons, the evaluation of the Ü(fietoriai system carried out in the current work centres upon a preliminary pilot study, which, in addition to examining the efficacy of the system, also aims to uncover related phenomena and to explore the potential for experimental psychology research within the domain of NLG evaluation, with a view to motivating, justifying and delimiting a full scale investigation to be executed within the programme of future work.

The Pilot Si^tOTica Evaluation Study (PRES) follows a simple design rubric: an argument taken from the corpus is analysed to elicit its internal structure and to infer the original beliefs of the speaker (with regard both to the arena of discussion, and to the beliefs and parametric characterisation of the audience). These beliefs and parameters then form the input to ü(fietorica, which constructs the deep structure of an argument appropriate to that input. The final stage is to construct, by hand, the surface text of that deep structure. It is during this process that bias could unintentionally be introduced by the experimenter, so to minimise this possibility, realisation was restricted almost entirely to the original wording, except where this was prohibited by syntactic constraints. Sentence boundaries and other punctuation devices were also maintained, except where clue realisation could not be performed in their presence. The aim of this stage is thus to have available two arguments on the same topic; the original and one generated by ^RfietorioL, which may have (i) potentially different orderings of premises and conclusions; (ii) potentially different enthymematic contraction; (iii) potentially different use of clue phrases; (iv) potentially different punctuation breaks; (v) potentially differing content (in that the j^toririz-generated argument may eliminate subarguments due to breadth or depth constraints).

In addition to these two versions of the argument, a third is also generated, again using the

^Rfietorica system, but employing only the planning mechanism in conjunction with argument integrity constraints - i.e. with all EG level heuristics inoperative. The textual realisation process is then the same as above. Finally, the entire process is repeated on two further arguments from the corpus. The three arguments exhibit a number of important differences: they have different levels of emotiveness (argument one, for example, concerns road signs; argument two, the rights of birth parents in cases of adoption); they are of different length (both textually and in terms of the number of functional units); they are of different complexity (both in terms of depth, breadth and range of subarguments); and they involve premises of different sizes. All arguments, however, are assumed to be aimed at a similar, broad audience (specifically, the arguments were all taken from the ‘Letters’ page of The Guardian

appearing during one week in October 1996) with a high level of technical competence and low level of scepticism. The arguments, their analyses and their H^torica generation, are those presented above in chapter six; the argument variants generated with EG functionality disabled are given in Appendix C.

Thus the subject is presented with three variants of each of three arguments: the original from

The Guardian (this is termed ‘Grig’ in the analysis), the argument generated by the full ü ^to rica

system (abbreviated ‘FullRhet’), and finally, the argument generated by ü^hetorica'without EG heuristic functionality (termed ‘NoFrills’). The subject is then provided with the following instructions: “... you are asked to rate the texts on the basis of how persuasive you find them. Following each text is a box: please enter a number between 0 and 9 to indicate whether you found the argument highly persuasive (9) or totally hopeless (0).” No further information is provided on how to perform the assessment. The

subjects are, unusually, quite diverse: in total, the 34 participants included computer science academics, CS students, non-CS academics, healthcare professionals, and university clerical staff. This wide range is important and appropriate - to take, for example, just a single cohort of students is not only unrealistic, but also quite different from the audience for whom the argument was originally intended (namely, the diverse readership of The Guardian).

In addition to factors such as emotiveness and complexity, which are varied across the arguments, it is also likely that primacy and recency effects would play a role in subject’s assessment of the arguments (though, on the basis of (Hovland and Mandell, 1957), it is not certain what that effect would be). For this reason, the order of presentation of the arguments was varied: for the first set of arguments, that generated by ü ^ to ric a (FullRhet) preceded the original (Orig), which in turn preceded the minimal H(fietoTicaargument (NoFrills); for the second, the order was Orig, NoFrills, FullRhet, and for the third, NoFrills, Orig, FullRhet. In this way, the primacy effect, recency effect, and argument ordering effect were all randomised to the fullest extent given only three replicates.

The raw results are given in Appendix C, and are summarised below in Figure 7.1. The results have been presented in two forms: firstly, the raw figures provided by the subjects, and secondly, the rankings based on those figures. Although the translation from the former to the latter involves a loss of information, it eliminates the skew introduced by individual subjects who employ an unusually high or unusually low range of marks.

Full Rhet Orig NoFrills Overall

Total 649 562 605 1816 Mean 6.36 5.51 5.93 5.93 StDev 1.94 1.84 1.75 1.87 #tlmes 1st 52% (53) 21% (21) 19% (19) #times 2nd 19% (19) 27% (28) 37% (38) #times 3rd 20% (20) 38% (39) 28% (29)

Figure 7.1 Summary of results from PRES as raw data and rankings

Though the sample size was small (a total of 34 subjects) - as is appropriate to a pilot study - the results are encouraging not only for fl^torica, but also for the approach in evaluative technique in general, and in terms of identifying factors which need to be taken into account in a full study.

The raw data has several pertinent features. The first is the high standard, and the relatively small difference in the mean scores of the argument types. These figures are summarised in the graph in Figure 7.2, from which it can be seen that even if the variance were smaller, the difference between averages o f the argument types is unlikely to be significant. Nevertheless, it is clearly demonstrated that the text produced by ü(fietoriai - with or without EG level processing - is at least as good as the original arguments. The results suggest that the maintenance of subargument integrity is thus sufficient for acceptable textual argument.

v n . PERORATION 153 Raw a s s e s s m e n t a v e ra g e s S e tlA v g S e t; Avg Set3 Avg ■Total Avg

Full R h et O rig NoFrills A rg u m e n t ty p e

Figure 7.2 Summary o f raw results from PRES

The high variance and consequent non-significant differences between argument types is to be expected, how ever, because subjects were given minimal guidance on how to evaluate arguments, and were at liberty to interpret the notion o f ‘persuasiveness’ in any w ay they saw fit. A more insightful m eans o f com parison between the argument types is by consideration o f the order in w hich subjects ranked arguments in a given set. Figure 7.3 illustrates the aggregate rank profile as a percentage, i.e. the frequency with w hich a particular argument type was rated as the best, middle, or worst in its set.

Rankings of argument types

Full Fihet O rig E N o F rills 50% i 40% H 30% - 10% - Best Middle Rank Worst

These ranked results show several important features. The first, and most striking, is the high frequency with which the arguments generated by the full 'J(fietoricaprocessing are rated the best in their set: these arguments were selected by subjects as the best in 52% of cases, as compared with subjects choosing the original as best in 21% of cases, and the EG-deactivated arguments as best in 19%^^. A similar, but less striking relationship is found amongst those arguments rated as ‘middle’ - the arguments generated by the version of Ü^torica with EG level processing deactivated were evaluated ‘middle’ in 37% of cases, as compared with 27% for the original arguments and 19% for those generated by the full

a^torico. Finally, original arguments were more commonly rated as the worst in a set: 38% (compared with 20% for the full ü(fietOTicaand 28% for the restricted system).

It seems from these figures that those arguments generated by the full iRfietoTica system performed significantly better than their natural, original counterparts at appearing persuasive. Furthermore, it is clearly demonstrated that the employment of the various forms of heuristic processing at the EG level does significantly contribute to the persuasiveness of a text - since the full

^^etOTica system far outperforms the version without such processing. These are particularly encouraging results, as it suggests that the rich rhetorical and psychological techniques exploited by

ü ^to rica enhance a text beyond what may be achieved by a naïve human author.

This is not to say that PRES has definitively proved the capabilities of the ü (fie to r ic asystem: as a pilot study, an analysis of the shortcomings and future extensions of PRES are almost as important as the results gained in the first instance. It is to this analysis that the next section is in part addressed.

7.2 Future work

Before examining the limitations and future extensions to the H^torica system itself, the discussion focuses upon the shortcomings of PRES, and the measures which could be taken to address those shortcomings in a larger, more rigorous study.

The first set of problems is to identify factors which may affect the outcome so that those factors may be controlled for. The first such factor to consider is the effect of primacy and recency. The PRES attempts to minimise these effects by randomising the order across argument sets, however with only three replicates, the effects could potentially impact the results. In analysing the data, however, it seems that primacy and recency play little or no role, as evinced by the summary in Figure 7.4: if such effects were impacting the results in PRES, a correlation would be expected between position and rating/ranking (positive if the dominant effect was recency, negative if primacy). Such a correlation is clearly not present to any significant degree (though cursory inspection might suggest a very slight recency effect). Given the identification o f the influence of primacy and recency on experimental results in previous studies, however, it would still be desirable to eliminate their effects entirely. This could be achieved quite simply in a larger study by offering only one argument from each group to any given subject (necessitating a larger sample size). It is assumed to be unlikely that primacy or recency would affect the reception of arguments in differing domains (e.g. the presentation of a ‘pro’ argument concerning roadsigns is unlikely to effect the reception of a ‘con’ argument concerning abortion).

The small discrepancy - these percentages should sum to 100% - is due to the occurrence of equally rated arguments: such ratings were ignored in the counts.

VII PERORATION 155

Effect of presentation position on judgements of persuasive strength

a t 15 ® 10 o SndPos P re se n ta tio n position FstPos ThdPos

Figure 7.4 Summary o f results on primacy/recency effects in PRES

Another, practical, problem with the PRES com es as a result o f the sm all corpus: although HihetOTica includes a rich, w ide range o f interacting arguments forms and heuristics, only a limited subset o f these w ere actually identified in the portion o f the corpus em ployed for the PRES. A more rigorous study w ould benefit from being preceded by a wider corpus study, from w hich it w ould then be possible to extract exam ples o f a more diverse range o f argument form s and heuristics (e.g. M odus T ollens, Inductive Generalisation, and their associated heuristics). The section o f the corpus em ployed in the PRES m ay also be susceptible to slight criticisms o f artificiality, since pieces on the ‘Letters’ page are subject to editing before appearing in print. This editing process may im prove or potentially dam age (through over-zealous enthym ematic contraction) an author’s original argument. A wider corpus study w ould elim inate this problem (though in defence o f the original approach, the arguments w hich appear in print are still persuasive m onologues aim ing to alter b eliefs or behaviour o f a particular audience, and are therefore encom passed within the remit o f the ^H^torica system ).

R elatedly, a full scale study w ould also need to address audiences o f different types - i.e. audiences about w hom an argument’s author maintains different assum ptions. This w ould not only test the hearer-sensitive aspects o f H(fietoriaL, but also widen the generality o f the investigation and, consequently, its results.

The sin g le greatest problem with the PRES, how ever, concerns its m eans o f eliciting subjects’ judgem ents. S im p ly asking explicitly for an estimate o f the persuasive effect o f a text carries with it a