We report results on the publically available FCE dataset (see Chapter 2, Section 2.1.1.1) and, following Briscoe et al. (2010), we parse the training and test data using the Robust Accurate Statistical Parsing (RASP) system (Briscoe et al., 2006) with the standard tokenisation and sentence boundary detection modules in order to broaden the space of candidate features suitable for the task. RASP, an open-source system, includes an unlexicalised parser, which is expected to perform well in the noisy domain of learner text, where misspellings and grammatical errors are common, though this is evaluated implicitly through the usefulness of the features extracted from the parser’s analyses. As in Briscoe et al. (2010), our focus is on developing an accurate AA system for ESOL text that does not require prompt-specific or topic-specific training. Although the FCE corpus of manually-marked texts was produced by learners of English in response to prompts eliciting free-text answers, the marking criteria are primarily based on the accurate use of a range of different linguistic constructions. For this reason, it is plausible to assume that an approach which directly measures linguistic competence will be better suited to ESOL text assessment, and will have the additional advantage that it may not require re-training or tuning for new prompts or assessment tasks.
We extract the features used by Briscoe et al. (2010), which are mainly motivated by the fact that lexical and grammatical properties should be highly discriminative for automatically assessing linguistic competence in learner writing, and replicate their model. Their full feature set is as follows:
1. Lexical ngrams (a) Word unigrams (b) Word bigrams
2. Part-of-speech (POS) ngrams (a) POS unigrams
(b) POS bigrams (c) POS trigrams
3. Features representing syntax (a) Phrase structure (PS) rules 4. Other features
(a) Script length (b) Error rate
Word unigrams and bigrams are lower-cased and used in their inflected forms. POS unigrams, bigrams and trigrams are extracted using the RASP tagger, which uses the CLAWS tagset. The most probable posterior tag per word is used to construct POS ngram features; however, given the large number of misspellings in learner data, we use the RASP parser’s option to analyse words assigned multiple tags when the posterior probability of the highest ranked tag is less than 0.9, and the next n tags have probability greater than 501 of it.
Based on the most likely parse for each identified sentence, the rule names from the phrase structure (PS) tree are extracted. RASP’s rule names are semi-automatically gen- erated and encode detailed information about the grammatical constructions found (e.g., ‘V1/modal bse/+-’, a VP consisting of a modal auxiliary head followed by an (optional) adverbial phrase, followed by a VP headed by a verb with base inflection). Moreover, rule names explicitly represent information about peripheral or rare constructions (e.g., ‘S/pp- ap s-r’, a S with preposed PP with adjectival complement, e.g., for better or worse, he left), as well as about fragmentary and likely extra-grammatical sequences (e.g., ‘T/txt- frag’, a text unit consisting of two or more subanalyses that cannot be combined using any rule in the grammar). Therefore, many (longer-distance) grammatical constructions and errors found in texts can be (implicitly) captured by this feature type.
Although FCE contains information about the linguistic errors committed (see Chap- ter 2, Section 2.1.1.1), Briscoe et al. (2010) try to estimate an error rate in a way that doesn’t require manually tagged data. They build a trigram language model (LM) using ukWaC (ukWaC LM) (Ferraresi et al., 2008), a large corpus of English containing more than 2 billion tokens. A word trigram is counted as an error if it is not found in the lan- guage model. They compute presence/absence efficiently using a Bloom filter encoding of the language models (Bloom, 1970). However, they also use an error rate calculated from the FCE error tags to obtain an upper bound for the performance of an automated error estimator (true FCE error rate).
Feature instances of types 1 and 2 are weighted using tf ∗idf and their vectors are normalised by the L2 norm, that is, the square root of the sum of squares. Feature type 3 is weighted using frequency counts, while 3 and 4 are scaled so that their final value has approximately the same order of magnitude as 1 and 2. The script length is based on the number of words and is mainly added to balance the effect the length of a script has on other features. Finally, features whose overall frequency is lower than four are discarded from the model.
In extending Briscoe et al.’s AA model, we hypothesise that features capturing the syntactic complexity of sentences should also be indicative of a learner’s writing compe- tence. More specifically, we investigate the impact of complexity measures representing the distance between a head and a dependent (in word tokens) in a grammatical rela- tion (GR). GRs represent syntactic dependencies between constituents in a clause, and are automatically identified by RASP. An example is illustrated in Figure 3.1 using an FCE excerpt, which shows the different types of relations between words represented as lemmas and POS tags.1 For example, ‘ncsubj’ represents binary relations between non-clausal subjects (NPs, PPs) and their verbal heads, as in have VH0 you PPY. The distance in word tokens between have VH0 and you PPY is 1, while the distance between If CS and have VH0 is 2. The direction of the relation, or equivalently, the position of
1The dependency graph was produced using the SemGraph tool:
If you have any more question write I a short letter . CS PPY VH0 DD DAR NN2 VV0 PPIO1 AT1 JJ NN1 .
cmod ncsubj obj2 dobj det ncmod ncmod ccomp ncsubj dobj
Figure 3.1: Example GR types and dependencies.
the head compared to the dependent distinguishes positive dependencies from negative ones. For example, have VH0 and you PPY have a positive dependency, while have VH0 and any DD a negative one (as the head precedes the dependent) (for more details see Briscoe, 2006).
We extract a number of complexity measures representing GR distance in various ways from RASP and explore their impact on performance. In particular, we experiment with 24 different numerical features, grouped for positive and negative dependencies and presented below:
1. GR-LONGEST-TOP-P/N: longest distance in word tokens between a head and dependent in a grammatical relation (GR) over the top ranked derivation for positive and negative dependencies (P/N) separately.
2. GR-TOTAL-TOP-P/N: sum of the distances between a head and dependent over the top ranked derivation for P/N dependencies separately.
3. GR-MEAN-TOP-P/N: the means for P/N dependencies calculated by dividing GR- TOTAL-TOP-P/N by the number of GRs in the set for the top parse only.
4. GR-LONGEST-NBEST-P/N: longest distance for P/N over the top 100 parses.2
5. GR-TOTAL-NBEST-P/N: sum of the distances for all GR sets over the top 100 parses for P/N separately.
6. GR-MEAN-NBEST-P/N: the means for P/N dependencies calculated by dividing GR-TOTAL-NBEST-P/N by the number of GRs in the top 100 parses.
7. NBEST-MED-GR-TOTAL-P/N: median for GR-TOTAL-NBEST-P/N (calculated over the top 100 parses).
8. NBEST-STD-GR-TOTAL-P/N: standard deviation for GR-TOTAL-NBEST-P/N. 9. NBEST-AVG-GR-TOTAL-P/N: average for GR-TOTAL-NBEST-P/N.
10. NBEST-MED-GR-LONGEST-P/N: median for GR-LONGEST-NBEST-P/N. 11. NBEST-STD-GR-LONGEST-P/N: standard deviation for GR-LONGEST-NBEST-
P/N.
12. NBEST-AVG-GR-LONGEST-P/N: average for GR-LONGEST-NBEST-P/N.
2We chose the top 100 parses mostly for convenience, as various statistics are easily available from
Intuitively these complexity measures capture aspects of the grammatical sophistica- tion of the writer through the representation of the distance between heads and depen- dents in various forms (e.g., longest or mean distance), and we hypothesise they can be used to assess linguistic competence. However, they may also be confounded in cases where sentence boundaries are not identified, for example, due to poor punctuation. In the experiments presented here, we evaluate performance of individual measures as well as their combinations for the AA task. We identify a set of discriminative complexity measures and use their values as features in the document vectors. Although these fea- tures bear some similarities to Lonsdale and Strong-Krause (2003)’s method, who roughly measure the complexity and deviation of a sentence from the parser’s grammatical model in order to assign a score to a text, this is the first study on the application of these com- plexity features on learner language assessment and their evaluation under a data-driven methodology.
Next, in order to get a better estimate of the error rate, we extend the ukWaC language model with trigrams extracted from FCE texts (ukWaC+FCE LM). As FCE contains texts produced by second language learners, we only extract frequently occurring trigrams from highly ranked scripts to avoid introducing erroneous ones to our language model. We hypothesise that by adapting the LM to the FCE vocabulary will further improve performance of the AA system, as it will allow us to calculate an error rate that will directly capture (correct) learner word-usage patterns.