• No se han encontrado resultados

CAPÍTULO 2: REVISIÓN EMPÍRICA, MODELO Y METODOLOGÍA DE LA INVESTIGACIÓN

2.3 METODOLOGÍA DE LA INVESTIGACIÓN DEL SUBMODELO 1

2.3.1 Los datos

In this section we present the experimental results on the tasks described in section 5.3, com- paring the various different verb representations.

Verb Similarity The correlation results on verb similarity tasks are displayed in Table 5.4.

The first two columns are Spearman’s ⇢ between the original skipgram verb vectors and the vectors trained on subjects and objects as context using the similarity metrics of Table 5.2. For the matrix and cube representations, we report the best scores of the subject-verb or verb-object matrices, using the matrix similarity metric from Table 5.2 and the parameterised middle/late fusion, and refer the reader to Appendix B. Finally, for the cubes the reported score is the highest obtained by the different verb argument clustering configurations, again with the full results tables in Appendix B.

va vs/o/b Va Vs/o Va MENv 0.282 0.248 0.500 0.589 0.035 SimLexv 0.046 0.272 0.163 0.340 0.024 VerbSim 0.338 0.563 0.085 0.550 -0.076 SimVerbd 0.224 0.249 -0.023 0.291 -0.012 SimVerbt 0.183 0.197 0.019 0.240 -0.025

TABLE5.4: Spearman ⇢ correlation on verb similarity datasets. The subscript v indicates that we are looking at the partial verb-only dataset. For SimVerb we distinguish between the development

SimVerbd and test set SimVerbd. We compare standard skipgram

vectors vawith specific context vectors vs/o/b, matrices with full sen-

tence context Va, our model that predicts one dependency argument

Vs/o, and cubes with a full sentence context Va.

For the case of verb vectors, the general skipgram model is outperformed by the vectors trained on the verb arguments as context, and in fact these show the highest performance on the VerbSim dataset. That the matrix and cube representations with the full sentence as

5For ELMo, we used Google’s module at https://tfhub.dev/google/elmo/2, for BERT

we used the python bert_embedding package from https://github.com/imgarylai/ bert-embedding.

context perform rather poorly, and in many cases worse than the vector representations, il- lustrates that the choice of context is too general for these higher-order representations. On four out of the five tasks, however, our proposed method of training matrices with a re- stricted notion of context, outperforms all other models, most significantly so for the 3000 entry test subset of the SimVerb dataset where we observed an increase from 0.183 to 0.240.

Neural Tensor Clustering Table 5.5 shows the correlation scores on the verbs of the com-

positional tasks discussed in the previous section. In this experiment, we are performing the sentence disambiguation and similarity tasks by only using the verbs of the sentences. In each case, we first apply the verb tensors to their subject and object clusters, use middle or late fusion where appropriate, compute their degrees of similarity, and use this degree for disambiguation or a straight similarity calculation. The implementation configurations are the same as in the verb similarity tasks. We observe the same pattern in the results: re-

va vs/o/b Va Vs/o Va ML2008 0.067 0.055 0.161 0.124 0.178 ML2010v 0.396 0.528 -0.004 0.638 0.003 GS2011 0.226 0.331 0.369 0.399 -0.028 KS2013a 0.184 0.100 0.062 0.218 0.003 KS2013b 0.445 0.638 -0.055 0.695 -0.025 ELLDIS 0.341 0.389 0.386 0.516 0.047 ELLSIM 0.370 0.577 0.022 0.643 0.011

TABLE5.5: Spearman ⇢ correlation for verbs of compositional tasks. Each score is a maximum score out of possible clusters and fusion weights. We again compare standard skipgram vectors vawith spe-

cific context vectors vs/o/b, matrices with full sentence context Va, our

model that predicts one dependency argument Vs/o, and cubes with

a full sentence context Va.

training of the verb vectors slightly improves the performance. This is against the erratic performance of the full context matrix representations and the very poor performance of the cube representations (on all but the ML2008 dataset). Again, our proposed matrix repre- sentations with a restricted context significantly outperforms the other methods. However, note that these results again are the highest of a parameter sweep over the cluster and fusion settings of the representations, with full results in Appendix B.

Compositional Models The most interesting results come from the compositional tasks.

These compose a representation for each sentence of the dataset by taking into account the representations of all of the words within that sentence, rather than by only working with individual word representations, as was done in the previous two tasks. The results in Table 5.6 show three baseline models on the left, and the three tensor skipgram representations that we trained, on the right. First, there is the arithmetic baseline C(+, ) where we com- pose either by adding or point-wise multiplying all the vectors in a sentence. Then, there are two non-neural baseline models for which we compose by using either the Kronecker verb matrix, C(VKron), or the relational matrix C(VRel), in one of the composition models de-

there is the matrix that is trained by predicting a full sentence context after transforming either the verb’s subject or object (C(Va)), whereas our proposed representations is the verb

matrix that transforms one of its arguments (subject/object) and predicts the other argument as context (C(Vs/o)). Finally, we compare to the cube model, in which a cube is trained by

transforming both subject and object vectors, and predicting the remainder of the sentence as context (C(Va)).

Baseline Neural

C(+, ) C(VKron) C(VRel) C(Va) C(Vs/o) C(Va)

ML2008 0.171 0.082 0.192 -0.045 0.188 — ML2010v 0.541 0.402 0.511 0.000 0.550GS2011 0.187 0.205 0.323 0.247 0.536 -0.021 KS2013 0.181 0.281 0.188 0.203 0.372 -0.043 KS2013b 0.672 0.530 0.511 0.542 0.753 0.064 ELLDIS 0.308 0.304 0.368 0.221 0.559 0.030 ELLSIM 0.671 0.522 0.646 0.532 0.759 0.093

TABLE 5.6: Spearman ⇢ scores on compositional tasks. C(+, ) denotes arithmetic models, whereas the other rows represent the best score for a compositional model with the different verb representations (C(VKron):

Kronecker matrix, C(VRel): Relational matrix, C(Va): Skipgram matrix

with sentence context, C(Vs/o): Skipgram matrix with argument as con-

text, C(Va): Skipgram cube with sentence context).

The results table shows that the neurally trained verb matrices with full sentences as context don’t significantly improve performance compared to the non-neural compositional base- lines, and in the case of disambiguation they are inferior. This shows that the choice of context matters a lot: here, the full sentence is taken as a context, but this is not discrimina- tory enough to achieve high correlation, whereas for instance the Relational matrix directly encodes subject and object information, allowing it to be more robust on the compositional tasks.

Similarly to the verb similarity results, the cubes show a very poor performance, which we argue is due to data sparsity. Even though the cubes implicitly model properties of ar- guments of the verbs, their representation is too sparse to effectively model anything. More- over, as with the verb matrices Va, they consider the full sentence as context since there is no

straightforward other way of defining this. Our proposed matrix model remedies both the sparsity problem and the choice of context, and outperforms all the other representations, save on the ML2008 dataset.

Sentence Encoders and Contextualised Representations We compare the results of our pro-

posed neural tensor embeddings with sentence encoder models in Table 5.7, and with the ELMo and BERT embeddings in Table 5.8.

Where we see a pattern similar to our study in Chapter 4 for the sentence encoders, namely that the tensor-based models work well on disambiguation tasks, but alternative sentence encoding methods work better on similarity tasks, we find that our embeddings generally outperform the contextualised encodings (ELMO, BERT) on all tasks. Although it is still an open question to what extent such language models are able to encode syntactic information, they definitely do not encode dependency information explicitly as in our proposal, which

C(Vs/o) D2V1 D2V2 ST IS1 IS2 IS3 IS4 USE ML2008 0.188 0.139 0.192 0.078 0.181 0.220 0.149 0.169 0.039 ML2010 0.550 0.512 0.447 0.494 0.631 0.492 0.636 0.405 0.325 GS2011 0.536 0.098 0.102 -0.157 0.297 0.320 0.324 0.213 0.094 KS2013 0.372 0.193 0.212 0.051 0.172 0.032 0.176 -0.021 0.210 KS2014 0.753 0.692 0.705 0.546 0.784 0.676 0.720 0.586 0.539 MLELLDISlin — 0.090 0.233 0.232 0.199 0.224 0.108 0.135 0.105 MLELLDISres 0.221 0.089 0.216 0.167 0.228 0.269 0.144 0.156 0.154 MLELLDISabl — 0.095 0.242 0.159 0.215 0.226 0.141 0.169 0.109 ELLDISlin — 0.199 0.227 -0.193 0.347 0.384 0.330 0.344 0.269 ELLDISres 0.559 0.231 0.253 -0.172 0.344 0.337 0.293 0.248 0.277 ELLDISabl — 0.195 0.259 -0.130 0.353 0.357 0.300 0.291 0.240 ELLSIMlin — 0.593 0.622 0.585 0.779 0.701 0.748 0.641 0.647 ELLSIMres 0.760 0.698 0.692 0.604 0.803 0.749 0.768 0.687 0.680 ELLSIMabl — 0.652 0.655 0.471 0.782 0.730 0.749 0.682 0.640

TABLE 5.7: Spearman ⇢ scores on compositional tasks, with state of the art sentence en- coders. D2V1: Doc2Vec1, D2V2: Doc2Vec 2, ST: Skip-Thought, IS1: InferSent 1 (4096), IS2: InferSent 2 (4096), IS3: InferSent 1 (300), IS4: InferSent 2 (300), USE: Universal Sentence

Encoder.

could explain the beneficial performance of our representations on the evaluation tasks that tend to contain relatively short sentences with a focus on syntactic awareness. We see this reflected in the fact that the contextualised embeddings of ELMO perform best on the ELL-

SIM dataset, granted that the ellipsis is resolved first.

The influence of the fusion parameter One interesting aspect of our proposed matrix model

is that two separate matrices are trained, that each optimise the prediction of one of the verbs dependency arguments (subject/object), given the other argument. When composing a sentence embeddings, we then have a choice of setting the parameter ↵ to fuse together the matrices, or their respective compositions. To see what is the effect of this parameter, we look at the influence of the value of alpha — 0 for the pure subject-verb matrix, 1 for the pure verb- object matrix, weighted sum of both in between — on the performance on the compositional tasks. For each dataset, we show the average effect across all tested composition models, showing this effect both for middle and late fusion. Table 5.9 displays the effect of the ↵ parameter on the intransitive sentence datasets ML2008 and ML2010.

In the case of ML2008 there is a preference for the subject-verb matrix (objects as contexts), and it is the other way around for ML2010. Performance goes down once the matrices are mixed in middle fusion (the light green line), whereas generally the late fusion boosts per- formance (the dark green line), illustrating the important of the choice between middle and late fusion.

Table 5.10 shows the effect of ↵ on all other datasets. As a general pattern, on these datasets the choice of middle versus late fusion has significance (with generally later fusion being the better choice) though the effect of ↵ is the same regardless of the fusion type, except on the GS2011 dataset. What is most notable is that for both GS2011 and ELLDIS, which contain the same verbs, there is a preference for the subject-verb matrix, with the peaks of the graphs for values of ↵ lower than 0.5, whereas the verb-object matrix is more important

C(Vs/o) ELMo BERT Small BERT Large ML2008 0.188 0.166 0.105 0.030 ML2010v 0.550 0.539 0.216 0.356 GS2011 0.536 0.108 0.187 0.292 KS2013 0.372 0.243 0.232 0.349 KS2014 0.753 0.728 0.520 0.616 MLELLDISlin — 0.193 0.373 0.315 MLELLDISres 0.221 0.182 0.103 0.342 MLELLDISabl — 0.123 0.089 0.193 ELLDISlin — 0.232 0.360 0.368 ELLDISres 0.559 0.210 0.216 0.274 ELLDISabl — 0.207 0.197 0.365 ELLSIMlin — 0.734 0.595 0.580 ELLSIMres 0.759 0.779 0.631 0.647 ELLSIMabl — 0.703 0.560 0.582

TABLE 5.8: Spearman ⇢ scores on compositional tasks, with state of the art contextualised embeddings. For BERT, we use the small and large ver- sions, for the small version we use both uncased and cased book corpus.

TABLE 5.9: The effect of the ↵ fusion parameter on middle and late fusion for in- transitive sentence datasets.

in the other datasets. The explicit results for all values of ↵ and per composition operator are listed in Appendix B.

TABLE 5.10: The effect of the ↵ fusion parameter on middle and late fusion for datasets GS2011, KS2013, KS2014, MLELLDIS, ELLDIS, and ELLSIM.

5.5 Conclusion

Type-driven compositional distributional semantics has shown that the symbolic formal se- mantic structure of a sentence can be transformed into a vectorial form, by representing the words therein as tensors whose ranks depend on their grammatical roles: nouns are repre- sented as vectors, matrices as adjectives, transitive verbs as cubes and so on. Tensors are multilinear maps and thus the type-driven distributional models offer a canonical form of composing them: via tensor contraction. The tensors, however, are high dimensional, and it is not clear how they should be learned.

In this final contribution chapter of this thesis, we generalised the widely used skipgram model to learn neural tensor embeddings for words with any number of dependencies. The role of a word tensor is to transform the embeddings of its dependencies and train a skip- gram objective to predict context vectors for the results of these transformations. The notion of context can vary here: we worked with full sentence contexts as well as restricted ver- sions thereof, where a tensor is applied to some of its dependencies and the results are used

to predict the rest. Our model reduces to the original noun skipgram model when no depen- dencies are involved, and covers the adjective-noun skipgram model of Maillard and Clark [MC15], where there is only one dependency.

We implemented our model on transitive verbs, learning cubes for them in a full sentence context, and matrix and vector approximations in a restricted subject and object context. In the approximated cases, the learned verb-subject matrix is applied to the subject vectors to predict the object context vectors, and the verb-object matrix is applied to the object vectors to predict the subject context vectors.

We experimented on word similarity, sentence similarity and verb disambiguation tasks. For verb similarity, we considered the verb-only fragments of MEN and SimLex-999, the 130 element VerbSim dataset [YP06], and the SimVerb-3500 dataset of Gerz et al. [Ger+16]. Our neural matrix embeddings provided the best results in all of the tasks, beating the full rank tensors as well as the baseline vectors – both with the general and restricted contexts. We fur- ther tested our models on the intransitive sentence similarity dataset of Mitchell and Lapata [ML10] and its transitive extension [KSP13], as well as to the intransitive sentence disam- biguation dataset of Mitchell and Lapata [ML08] and its transitive extensions [GS11a;KS13]. The results have the same pattern: the neural matrix approximations of verbs outperform their neural cube representations, their non-neural matrix representations, the additive and multiplicative models, and the verb-only vector and tensor baselines. We inspected the ef- fect of fusion on task performance, and saw that some tasks benefitted more from a focus on the objects-as-contexts verb matrices, whereas other tasks preferred the subjects-as-context verb matrices.

Given the full generality of the model and the promising initial experimental results, the tensor skipgram model paves the way for a new generation of type-driven distributional semantic models. In the disambiguation and similarity tasks that we evaluated on in this study, we found that the best models of verbs were those that used a specific context — the subject or object — which came from a dependency parsed corpus. Although this is not a pure tensor-based model as it uses two matrix approximations of the verb, this model strikes a balance between feasibility of training on the one hand and specificity of the encoded information on the other hand. In future work, we aim to expand this model in two ways: first, we wish to investigate more of the properties of the representations to gain a better insight of what is really encoded in the representations themselves — as opposed to the more task-centric view we held here. Second, we would like to develop these representations for any word in a dependency parsed sentence, which would then allow us to evaluate not just on focussed tasks but also on more general natural language understanding tasks, such as natural language inference, or question answering.

Part IV

Chapter 6

Conclusion & Future Work

This thesis presented the result of a three years’ investigation into compositional distribu- tional models. We summarise here the main contributions of the thesis and end with some directions for future endeavours.

6.1 Summary

Where we started off by giving the general background of word embeddings and their com- positionality in Chapter 1, in Chapter 2 we delved deeper into a particular, type-driven, approach to composition, which started off with the work of Coecke, Sadrzadeh, and Clark [CSC10] and Coecke, Grefenstette, and Sadrzadeh [CGS13], describing in the language of category theory how to interpret the grammatical structure of a sentence as a multi-linear transformation applied to the embeddings of the individual words in a sentence.

The models developed along these lines have been experimented with extensively [GS15;

KSP13;Mil+14], but all assume that there is a one-on-one relation between the text of a sen- tence and its meaning. Thus, the challenge we addressed in this thesis, is that of finding a compositional distributional model that is robust against cases in which the meaning of a sentence is not explicitly given by its surface form; the test case for this was ellipsis with anaphora. To solve this challenge, we developed the theory for a compositional vector space model of ellipsis and anaphora, relying on a unimodal extension of the Lambek Calculus to provide a grammatical model for ellipsis and anaphora, in Chapter 3. This model could then deal with the recovering of implicit semantic content (as one finds in examples of ellipsis), by means of a limited form of contraction in the grammar logic. We discussed how this model uses different structural rules to accommodate different linguistic phenomena, and as such can be ported to deal with pronoun relativisation as well. That the theory does not always suit the implementation was shown by the fact that a model that directly maps types to vec- tor spaces and proofs to (multi-)linear maps will give unwanted predictions in the presence of ambiguous elliptical phrases: different interpretations of an ambiguous sentence were shown to coincide in meaning. To amend this, we relaxed the model to a setting in which a non-linear term calculus interprets our grammar logic with controlled contraction. We then show how the structural ambiguity puzzle can be solved on the level of semantics.

In order to give experimental support for the models that deal with ellipsis and anaphora, we introduced in Chapter 4 three new datasets that allow one to contrast concrete distribu- tional models that do not resolve ellipsis, i.e. the what-you-see-is-what-you-get approach, with models that perform linguistic analysis to give the intended meaning of a sentence.

Using these new tasks, we showed that indeed resolving verb phrase ellipsis gives a posi- tive boost to the correlation of a model with human judgments. Moreover, the experiments showed that state of the art neural sentence embeddings are not always the optimal choice when assessing sentence comprehension.

Finally, in Chapter 5 we address the issue of lexical semantics in a tensor-based model: although some methods have been around to concretely derive the content of word tensors, most of these approaches either suffer from data sparsity issues or from overparameterised training models. To amend this, we formulated a generalisation of the well-known skip- gram model [Mik+13] to describe a class of models that may be implemented for any word of any grammatical type. We instantiated this model on the case of transitive verbs, and evaluated on all the compositional distributional tasks that we had considered so far. The results indicated that our matrix decomposition model, which trains two separate matrices per verb, always outperformed previous analytical approaches to verb representation, and mostly outperformed neural sentence encoders and contextualised embeddings.