MARIA DEL MAR DIEZ CASASNOVAS MARÍA GLORIA VARGAS RODRÍGUEZ
4. Població beneficiària i la seva ex- ex-periència
I used an existing implementation of the SARI metric (Xu et al., 2016)2 to eval-
uate the sentence simplification systems described in this thesis. Xu et al.(2016) note that SARI “principally compares [s]ystem output [a]gainst [r]eferences and against the [i]nput sentence.” It is based on a comparison of each sentence gen- erated by a simplification system in response to a given input sentence with both the original form of the input sentence and with the set of sentences generated by human simplification of the input sentence. This metric is preferred to BLEU for the evaluation of sentence simplification systems because it is noted to corre- spond better with human judgements of simplification quality (Xu et al., 2016). SARI provides a measure of the similarity between a single sentence and its sim- plification. I adapted the implementation to compute an average score over all simplified sentences output by the systems (for each type of sentence and each text register).
In addition to the SARI evaluation metric, I calculated the F1-score of the
method as the harmonic mean of precision and recall, given by Algorithm 2. In this algorithm,
sim = 1 ( ld(h, r)
max(length(h), length(r)))
where h and r are sentences occurring in the gold standard and in the system re- sponse, respectively; ld is the Levenshtein distance between h and r (Levenshtein,
2Available athttps://github.com/cocoxu/simplification/blob/master/SARI.py. Last
1966);3 and length(x) is the length of x in characters. The intuition for use of
Algorithm 2 is to find, in a greedy manner, the best matches between sentences produced by the system and sentences in the gold standard while still allowing some small differences between them.
Input: H – set of simplified sentences in the gold standard for a given input sentence S
R – set of simplified sentences produced by the system for input sentence S.
H0 H.
R0 R.
Output: P recision, Recall
1 matched_pairs = 0 2 while |H| 6= 0 and |R| 6= 0 do 3 h, r arg max h2H,r2R (sim(h, r)) 4 if sim(h, r) > 0.95 then 5 H = H\{h} 6 R = R\{r} 7 matched_pairs += 1 8 else 9 break 10 end 11 end 12 P recision = matched_pairs|H 0| 13 Recall = matched_pairs|R 0|
Algorithm 2: Evaluation algorithm for sentence simplification
Table 6.3 displays evaluation statistics for methods to simplify Type 1 sen- tences obtained using the SARI and F1 metrics. These include the simplification
methods presented in Chapter 5 of this thesis. The Bsln subcolumn displays the
3I used the Perl implementation of Levenshtein distance posted athttps://www.perlmonks.
performance results of a baseline system exploiting the transformation schemes and handcrafted rule activation patterns presented in Section 5.2, but with each sign tagged using the majority class label observed for that sign in our anno- tated data. In this setting, with the exceptions of those listed in Table 6.4, all signs were tagged with class label SSEV (left boundaries of subordinate clauses). Comparison of these results with those in the OB1 column indicates the con- tribution made by the automatic sign tagger to the simplification task. The MUSST column presents evaluation results for a reduced version of the MUSST sentence simplification system (described in Section5.1.1.1, page 119).4 MUSST
implements several types of syntactic simplification rule. In the table, I focused on performance of the one which splits sentences containing conjoint (compound) clauses, which is used to simplify Type 1 sentences. I deactivated the other trans- formation functions (simplifying relative clauses, appositive phrases and passive sentences). STARS is a method for automatic simplification of Type 1 sentences which implements the sentence transformation schemes specified in Section5.2.1 of this thesis. To identify the spans of compound clauses in input sentences and to implement the rule activation patterns used in the sentence transformation schemes, STARS uses the sequence tagging approach described in Section 4.3. Thus, STARS exploits machine-learned rule activation patterns. It is a fully au- tomatic system, exploiting machine learning methods for sign tagging (Chapter 3) and for identification of the spans of compound clauses. The STARS column
4Available at https://github.com/carolscarton/simpatico_ss. Last accessed 7th Jan-
uary 2020. Experiments conducted in my evaluations were based on a version downloaded and modified in January 2018. I am not aware of any subsequent change made to the system since then.
in Table 6.3 presents evaluation figures for this sentence simplification method. OB1 is also an implementation of the sentence simplification method presented in Chapter 5, which exploits the handcrafted rule activation patterns described in Section 5.2.2 of that chapter. In Table 6.3, the OB1 column displays the per- formance of this system when operating in fully-automatic mode, exploiting the sign tagger described in Chapter3. The Orcl column displays the performance of the OB1 sentence simplification method when it exploits error-free sign tagging (an oracle).
Table 6.3: System performance when simplifying Type 1 sentences Register Bsln MUSST STARS OB1 Orcl
SARI Health 0.201 0.124 0.309 0.362 0.514 Literature 0.203 0.087 0.190 0.202 0.229 News 0.119 0.171 0.478 0.596 0.623 F1-score Health 0.362 0.281 0.532 0.495 0.613 Literature 0.150 0.101 0.286 0.208 0.262 News 0.233 0.237 0.623 0.690 0.706
Table 6.4: Tags most frequently assigned to the signs in our annotated corpus Majority
Tag Signs
CEV [; or], [: but], [: and], [; but], [; and], [, but], [, and] CLN [or] CMN1 [, or] CMV1 [and] ESEV [,] SPECIAL [: that] SSCM [:]
According to the F1 metric, when transforming Type 1 sentences in the reg-
isters of health and literature, the output of OB1 is more similar to the gold standard than the output of the baseline (Bsln) is. For both evaluation metrics, in this task, the performance of OB1 also compares favourably with that of the reduced version of MUSST, which exploits a syntactic dependency parser. Calcu- lated by comparing per-sentence Levenshtein similarity between sets of simplified sentences, two tailed paired sample t-tests revealed that the observed differences in performance between OB1 and MUSST and OB1 and Bsln are statistically significant for both F1 and SARI metrics for texts of all registers (p ⌧ 0.01).
The only exception was when comparing the SARI scores obtained by the Bsln and OB1 systems when processing texts of the literary register (p = 0.0604).
When transforming Type 1 sentences in the register of health, the F1-score of
STARS, which exploits machine-learned rule activation patterns is greater than that of OB1, which uses handcrafted rule activation patterns. Use of two tailed paired sample t-tests indicates that this difference is statistically significant (p = 0.01119). The reverse is true when simplifying sentences of the news register (p = 0.0004). There is no statistically significant difference in the accuracy of the two systems when simplifying Type 1 sentences in literary texts (p = 0.1739).
Table6.5presents the accuracy of the methods derived using the SARI and F1
metrics when simplifying Type 2 sentences. In this evaluation, the columns and rows of the table are similar to those of Table6.3, though the evaluated simplifi- cation methods are those which use transformation schemes and rule activation patterns to detect and simplify complexRF NPs in input sentences. In the case
Table 6.5: System performance when simplifying Type 2 sentences Register Bsln MUSST STARS OB1 Orcl
SARI Health 0.207 0.020 0.182 0.285 0.296 Literature 0.168 0.008 0.051 0.204 0.289 News 0.434 0.056 0.194 0.451 0.467 F1-score Health 0.231 0.063 0.281 0.306 0.315 Literature 0.572 0.000 0.248 0.516 0.791 News 0.583 0.141 0.373 0.577 0.629
of the MUSST system, the activated simplification rule was the one used to split sentences containing relative clauses, which is used to simplify Type 2 sentences. The SARI evaluation metric indicates few statistically significant differences in the accuracy of the OB1 and Bsln systems when simplifying Type 2 sentences (Table6.5). A statistically significant difference in performance was only evident for sentences of the health register, where p = 0.036. By contrast, differences between the accuracy scores obtained by OB1 and MUSST are statistically sig- nificant, in favour of OB1, when simplifying Type 2 sentences in texts of all registers (p ⌧ 0.01).
In terms of F1, when simplifying Type 2 sentences in texts of the registers
of literature and news, the Bsln baseline is more accurate than my approach (OB1). The performance of OB1 was superior to that of Bsln when processing texts of the health register. Differences in the accuracy of the OB1 and Bsln sys- tems are statistically significant for texts of the registers of health and literature (p < 0.0005). For the task of simplifying Type 2 sentences, performance of the OB1 system is far superior to that of the reduced version of MUSST. The system
exploiting handcrafted rule activation patterns (OB1) is more accurate than the one exploiting machine-learned patterns (STARS) when simplifying Type 2 sen- tences in texts of all three registers. The differences are statistically significant (p ⌧ 0.0001 in all cases).
When considered over all text registers, the difference in F1-scores obtained
by the OB1 and STARS systems is statistically significant when simplifying Type 1 sentences and Type 2 sentences. In the former case, the STARS system tends to be superior while in the latter, the OB1 system is superior.