• No se han encontrado resultados

In the previous sections, we have outlined an approach to determine an antecedent’s compatibility with a pronoun’s context that relies directly on the co-occurrence counts of the respective nouns and verbs. Here, we will explore an approach that transforms word co-occurrences into a vector representation in the domain of vector space models. Cosine similarity between the word vectors that represent the antecedent candidate noun and the words in the pronoun’s context then serves as a means to determine the compatibility between the antecedent and the pronoun context.

6.3.1 word2vec as a framework to derive word embeddings

While there are numerous approaches to construct vector representations of words based on first-order co-occurrences, we make use of word2vec8 (Mikolov et al., 2013c,b,a), a

state-of-the-art tool for this task. word2vec learns vector representations of words in a vector space model, much like in latent semantic analysis (LSA) (Landauer et al., 1998) and other related approaches. In related vector space models, word vector representa- tions are obtained by constructing vectors with n dimensions, where n denotes the n most frequent nouns in a corpus, for example. The co-occurrence count of each word with these n most frequent words then comprises the vector representation of any given word. Cosine distance between these vector representation can then be used to estimate similarity of words. In LSA, singular value decomposition factorizes the co-occurrence matrix (constructed using the words in the vocabulary as rows and the n most frequent words as columns) to retrieve compressed representations with fewer dimension. Three smaller matrices are constructed, where one represents a clustering of the rows, and one denotes a clustering of the columns. The third matrix describes how these compressed matrices can be combined to retrieve an approximation of the original matrix.

8

To learn the vectors, we use word2vec itself (https://code.google.com/p/word2vec/). To inte- grate the vectors into our Python code, we use gensim (https://radimrehurek.com/gensim/models/ word2vec.html).

Chapter 6. Semantics for pronoun resolution 149

The main difference of word2vec (and related neural word embeddings) to traditional models, such as LSA, is that the meaning of the vector dimensions are latent to begin with. Unlike LSA, the dimensions cannot be thought of as clusterings of rows and columns of a co-occurrence matrix. The dimensions and their values are mathematical artifacts of the optimization process during training. Instead of taking the n most frequent words, word2vec takes n latent dimensions and learns values for them in a deep learning/neural network-inspired fashion. However, compared to related work, the approach removes the need for hidden layers, which makes it fast to train. The basic idea is to learn vector representations of words based on positive and negative examples of context words. Co-occurrence contexts of words are mined from large corpora. Positive examples of context words for a given word w are drawn from a window whose size is defined as one of the model parameters. Negative examples are drawn from outside the co-occurrence window of word w. Given the positive and negative evidence, the objective is then to learn a set of parameters so that the vector denoting the word w, i.e. ~w, has a high similarity with the vectors of the context words ~c within the window, and a low similarity with context words outside the window.9 The word embeddings

produced by word2vec out of the box have been shown to outperform several traditional word vector models on a variety of tasks (Baroni et al., 2014, inter alia).

6.3.2 Application of word embeddings to pronoun resolution

For our purpose of assessing the compatibility of a given antecedent candidate ai and a

pronoun’s context, we use the vector representations of the antecedent head word, i.e. ~

ai and the (relevant) words in the pronoun’s context and calculate the average cosine-

based similarity cos sim(·, ·) of the antecedent word vector and the word vectors in the pronoun’s context, i.e. the vector of the verb governing the pronoun ~vj and the vector

of its additional argument, ~argk:

comp(ai, vj, argk) = ∅ cos sim(~ai, ~vj), cos sim(~ai, ~argk)

However, as stated above, our interest is not to model general word similarity, the typical test task for word embeddings. Our aim is to determine the compatibility of a given

9

This describes the workings of the skipgram approach using negative sampling on a very basic conceptual level. Discussing the learning algorithm in detail is beyond the scope here. We refer to the original papers cited above and recommend Goldberg and Levy (2014) for an approachable mathematical exposition. Furthermore, Levy and Goldberg (2014b) showed that word2vec seems to factorize a PMI matrix internally. Also, Levy et al. (2015) have shown that given the right parameter settings, traditional approaches to word vector estimation, such as PMI and singular value decomposition of the PMI matrices perform on par with word2vec. However, word2vec achieves state-of-the-art results right out of the box and does not need extensive parameter tuning.

Chapter 6. Semantics for pronoun resolution 150

antecedent candidate with a given pronoun context. Therefore, we aim to specifically model the compatibility of an antecedent candidate’s head noun and the position of the pronoun in its context. Therefore, we must take into account the grammatical role of the pronoun in its context.

For example, consider a compatibility model that simply takes into account the selec- tional preferences of the verb governing a pronoun to determine compatibility with the given antecedent candidates. Let us assume two test instances:

(13) Er bereitet den Braten zu. He prepares the roast. (14) Der Koch bereitet ihn zu.

The cook prepares him* (it).

We want to resolve each pronoun in turn and we have the antecedent candidate Koch for both examples. In the first example (13), Koch is a very likely antecedent for Er in the subject argument slot of the verb zubereiten, because cooks normally prepare food. Thus, we can assume that the vector based similarity of Koch and zubereiten would be high and we would select Koch as antecedent for Er.

For the second example (14), however, where we want to resolve ihn, which is the direct object of the verb zubereiten, we would perform exactly the same query as in the first example given our word vector representation. That is, we would look up the similarity of the Koch vector and the zubereiten vector, which would be the same as in the first example. We would thus assume that we can select Koch as the antecedent of ihn. However, Koch is a very unlikely direct object of zubereiten.

That is, we cannot straight-forwardly apply the word2vec approach to assess compatibil- ity of antecedent candidates with a pronoun’s context, since the model lacks any notion of grammatical functions.10 To alleviate this, we perform a simple transformation of

the input that is fed into word2vec. We concatenate the words (i.e. their lemmas) in a sentence with their grammatical functions, like in the following example:

(15) Original sentence: Der Koch bereitet den Braten zu. Input sentence: Derdet Kochsubj zubereitenroot Bratenobja

10

Note that there are approaches that incorporate syntax into word2vec, notably Levy and Goldberg (2014a), or vector representations in general, e.g. Rothenh¨ausler and Sch¨utze (2009). However, these approaches make use of syntax to identify relevant context words, i.e. syntactic co-arguments for target words, but do not derive vectors for individual grammatical functions that a word occurs with.

Chapter 6. Semantics for pronoun resolution 151

Now, we will learn separate vector representations for the subject (e.g. “Kochsubj”)

and direct object (“Kochobja”) instantiations of words, which will enable us to more

specifically determine compatibility of antecedent candidates and pronoun contexts. In our examples above (13,14), a vector representation that is ignorant of grammatical functions would yield maximal similarity for the antecedent candidate “Koch”, since the antecedent word itself occurs in the pronoun context as an additional verb argument. Clearly, this high similarity would boost the “Koch” candidate as antecedent. Selecting it would, however, produce a non-sense sentence, i.e. “Der Koch bereitet den Koch zu.” Including the grammatical roles in the vector representations lowers the similarity of “Koch” and “Koch” from 1 to “Kochobja” and “Kochsubj”, i.e. 0.55.

In comparison to the graph-based approach, the word2vec model has the advantage that it is able to calculate similarity between any two words in its vocabulary, since it does not rely on first and second-order co-occurrences directly. Thus, we expect the word2vec model to have a broader coverage and applicability. On the other hand, the graph-based model more directly implements the notion of compatibility of verbs and their arguments based on co-occurrence, because it explicitly models the co-occurrences in designated syntactic contexts. Therefore, we expect the graph-based approach to provide high precision.