• No se han encontrado resultados

DESARROLLO DEL MANUAL DEL SUBPROCESO DE EVALUACIÓN DEL DESEMPEÑO AL

CONCLUSIONES Y RECOMENDACIONES

This module focuses on identifying the entities by extracting their mentions in the text. Anentity

(or existent in Chatman’s taxonomy72) is something, physical or abstract, that exists as a particular unit. Entities in a narrative include physical existents such as characters, locations or props, and abstract existents such as events or time expressions. Each entity may be realized in a text using differentreferring expressions. Each instance of these referring expressions is commonly known as a

mention. For example, in Figure4.5“the boy” is a mention or referring expression toVasili. Vasili

is the canonical referring expression of an entity that has been previously introduced and happens to be a character and the hero of that story. Throughout this chapter we will use the termmention

to refer to each of the specific mentions of an entity. When we later talk about characters, we will be referring as the entities

In order to extract mentions, we developed an algorithm that uses syntactic parse trees (as generated by the Stanford CoreNLP). Our algorithm traverses each of the sentence parse trees

Figure 4.5: Syntactic parse of a sentence annotated by Voz after the mention extraction process. Mentions or referring expressions are colored. Note how there is a compound mention (shaded in blue) that is recursed and mentions to 3 entities are found (shaded in purple, pink and orange).

Figure 4.6: Syntactic parse of a sentence annotated by Voz after the mention extraction process. Lists are identified (shaded in purple and light orange) and references to individual mentions extracted.

looking for a “noun phrase” (NP) node. For each NP node, Our algorithm does the following: If the subtree rooted at the current NP node contains nested clauses (such as verb phrases or prepositional phrases) or the leaves of the subtree contain an enumeration (a conjunction or a list separator token) then our algorithm traverses its associated subtree recursively. Otherwise, if any leaf in the subtree is a noun, personal pronoun or possessive pronoun, the node is marked as amention, and its subtree is not explored any further. Using this process with the input sentence “The captain of the ship saw the boy” (illustrated in Figure 4.6), our algorithm detects three individual mentions (shaded in purple, pink and orange). After finding the compound “The captain of the ship,” the algorithm recursively detects two nested mentions, “the captain” and “the ship”.

A special case is considered for enumerations: the deepest node containing an enumeration token (indicated by a comma or theandoror conjunctions) is marked as a list. Lists can be used later to match plural pronouns with the mentions in the list. Figure4.6shows the list “a man and a woman” shaded in purple. In a later stage, the coreferenced pronoun “their” in the following phrase (also shaded in purple for illustrative purposes) can be used to obtain the individual “man” and “woman” mentions.

Experimental Results

In order to evaluate this module, we used a dataset of 21 stories from our corpus of Russian and Slavic folk tales. Please refer to Chapter3for more information on our dataset. To reduce preprocessing issues at the discourse level, we manually removed quoted and direct speech (i.e., dialogues and passages where the narrator addressed the reader directly). The edited input dataset contains 914 sentences. The stories range from 14 to 69 sentences (µ = 43.52 sentences, σ = 14.47). There

is a total of 18126 tokens (words and punctuation; µ = 19.83words per sentence, σ= 15.40). To

evaluate the task of mention extraction, we annotated 4280 noun phrases (NP) representing referring expressions.

Our algorithm identifies 4791 individual mentions, including all of the annotated noun phrases and 511 of which are not actual referring expressions but parsing errors, mostly adjectival phrases identified as nominal phrases by the off-the-shelf NLP tools used. For example, in the sentence “And indeed she was warmer.”,warmer was wrongly identified as a noun phrase. Our method has a recall of 100% (all of the annotated mentions were found) but a precision of 89.3% (f = 0.944).

4.3

A Machine Learning Approach to Identifying Characters from Ex-

tracted Mentions

Once we have identified the mentions to entities in the text, an important task for our approach to modeling a narrative is to identify the characters that participate in a story. This is task is counterintuitively difficult in the domain of fictional stories because of a number of reasons, mainly: 1) the use of specific proper names for characters specific to the Slavic folklore (e.g., Morozko or Baba Yaga), 2) the presence of other fantastical creatures not commonly found in other domains (e.g., dragons or goblins), and 3) the presence of animals and even anthropomorphic objects that play a character and fulfill a narrative role (e.g., a talking mouse or a magic oven). In the second and third case, the problem is magnified by the fact that these may use the pronoun “it” which may confuse them with other animals or props. Related work in this area has acknowledged the difficulty of this task for domains like literary fiction13 and movie plot summaries56.

extracted mentions. We solve this classification problem using a machine learning approach inspired by case-based reasoning. Specifically, we try to classify the extracted mentions as characters (ani- mated sentient beings in the story) and non-characters (remaining existents and happenings defined in Chatman’s taxonomy72 such as locations, animals, props or happenings).

Our character identification method uses case-based reasoning (CBR)156, a family of algorithms that reuse past solutions to solve new problems. The previously solved problems, called cases, are stored in a case-base. In our approach, each case is an extracted mention, represented as a feature- vector, annotated as either character or non-character by a human. Given a problem or a query, case-based reasoning uses a retrieval step similar tok-nn157 where instances from the case base are selected based on a similarity metric. Then a solution to the problem or query at hand is adapted from the retrieved cases. In terms of a classification problem, the case base contains labeled examples and when retrieved, the labels in the examples are used to predict the label of the query. We address the following problem: given a mention, extracted from an unannotated story in natural language determine wether it is a character or not.

In this work we present two contributions: 1) a set of features used to compute a feature vector describing each mention, and 2) a novel similarity measure used to determining the most similar cases for retrieval that we called theweighted continuous Jaccard distance14.

4.3.1

Verb Extraction

This module runs in parallel to the previous modules and is tasked with extracting verbs and their arguments from the text. These verbs will be used to identify interactions and relationships between the previously extracted mentions.

To identify the verbs in a sentence,Vozuses the typed dependencies from theStanford CoreNLP

output. SpecificallyVozlooks at dependencies of type “nominal subject” or “passive nominal subject” where thehead wordis POS-tagged as a verb (this excludes linking verbs). In the case of “nominal subject”, the dependent of the typed dependency is considered the subject of the verb. All of the remaining dependencies of the verb are explored and if a mention is found in any of the dependencies (i.e., direct object, indirect object and prepositional objects) it is extracted and tagged as one of