• No se han encontrado resultados

EL PRECIO O CONTRAPRESTACIÓN EN EL CONTRATO UNDERWRITING

6.2.1

Definitions

Distributional approaches to detecting cross-lingual semantic word similarity from non-parallel data are all based on an idea known as the distributional

hypothesis [119], which states that words with similar meanings are likely to appear in similar contexts. Each word is typically represented by a high- dimensional vector called context vector in a feature vector space or a so-called

semantic space, where the dimensions of the vector are its context features. The

semantic similarity of two words, wS

1 given in the source language LS with vocabulary VS and wT

2 in the target language LT with vocabulary VT is then: sim(w1S, wT2) = SF (vec(w1S), vec(wT2)) (6.1) where vec(wS

1) is an N -dimensional context vector with N context features cn: vec(wS1) = [scS1(c1), . . . , sc1S(cn), . . . , scS1(cN)] (6.2) scS

1(cn) denotes a co-occurrence score for w1S associated with context feature cn (similar for wT

2). SF is a similarity function operating on the context vectors.1

Modeling Co-Occurrence: Weighting Functions. As mentioned, given a word wS1 ∈ VS, scS

1(cn) assigns a co-occurrence score of the word w1S with

some context feature cn. Distributional models differ in the way each cn is weighted, that is, the way the co-occurrence of wS1 and its context feature cn is mapped to the score scS1(cn). There exists a variety of options for weighting: the values of scS

1(cn) are typically raw co-occurrence counts C(wS1, cn), conditional feature probability scores P (cn|wS1), weighting heuristics such as term frequency-

inverse document frequency (TF-IDF), point-wise mutual information (PMI), or association scores based on hypothesis testing such as log-likelihood ratio (LLR).

Obtaining Similarity Scores: Similarity Functions. Once two words are

represented as N -dimensional vectors in the same feature space (see eq. (6.2)), it is possible to measure their similarity in that feature space by means of a

similarity function. There is a plethora of different similarity functions organized

1The reader has to be aware that the presentation of work in part III tackles the more difficult cross-lingual setting. We present a unified general probabilistic framework which does not change its modeling premises regardless of the actual setting (monolingual vs. cross- lingual or multilingual). All proposed models are fully functional in the monolingual setting. Monolingual models of similarity are special cases of cross-lingual models and are subsumed by this framework.

92 MODELING SEMANTIC SIMILARITY BASED ON LATENT CROSS-LINGUAL TOPICS

in different families according to [43]: (1) the inner product family of SF-s such as the cosine similarity used in [93,160] or the Jaccard index [243,123], (2) the Minkowski family, with SF-s such as the Euclidean distance or the city-block metric as used in [247], (3) the fidelity family, with SF-s such as the Bhattacharyya coefficient [144], the Shannon’s entropy family, with SF-s such as the Kullback-Leibler divergence [312] or the Jensen-Shannon divergence [236], (4) the graph-based family, with SF-s such as SimRank [164], or (5) the family of SF-s tailored specifically for measuring semantic similarity such as the Lin Measure [177], etc. For an overview of these similarity functions and even more options, we refer the interested reader to the survey papers [167,43].

Output of Models of Semantic Similarity: Ranked Lists. After applying a similarity function, for each source word w1S, we can build a ranked list

RL(w1S). The ranked list consists of all words wTj ∈ V

T ranked according to their respective similarity scores sim(wS1, wjT). In the similar fashion, we can build a ranked list RL(wT2), for each target word w2T. We call the top M best

scoring target words wTj for some source word wS1 its M nearest neighbors. The

ranked list for wS1 comprising only its M nearest neighbors is called pruned

ranked list (i.e., the ranked list is effectively pruned at position M ), and we

denote it as RLM(w1S). The single nearest cross-lingual neighbor for wS1 is called

its translation candidate, and in case that is a word wT

2, we write T C(w1S) = w2T.

One may construct a one-to-one bilingual lexicon from the output ranked lists of semantically similar words by simply harvesting all translation candidates, that is, by retaining all cross-lingual pairs (wS

1, T C(wS1)). The pair (w1S, T C(wS1)) is

referred to as a translation pair or a bilingual lexicon entry.

6.2.2

Related Work (Shared Cross-Lingual Features)

In order to compute cross-lingual semantic word similarity, one needs to design the context features of words given in two different languages that span a shared

cross-lingual semantic space or a shared cross-lingual vector space. It means

that words need to have the same representations over the same set of features irrespective of their actual language. Context vectors vec(w1S) and vec(w2T) for

both source and target words are then compared in the shared semantic space independently of their respective languages. Such cross-lingual semantic spaces are typically spanned by:

(1) Entries from an external bilingual lexicon which is hand-crafted or extracted from a parallel corpus [246,93,247,73,92,102,213, 99, 160,269,6,173,174,

283]. These approaches presuppose existence of an expensive external resource in the form of a bilingual lexicon or parallel data, which is a rather heavy

CROSS-LINGUAL SEMANTIC SIMILARITY: AN OVERVIEW OF DISTRIBUTIONAL MODELS 93

assumption for many language pairs and domains for which such high-quality resources do not exist.

(2) Predefined explicit cross-lingual categories obtained from a knowledge base or an ontology [69,94,49,121,2,122,193]. The typical features are Wikipedia categories, Wikipedia anchors or categories from EuroWordNet [310]. A problem with these approaches again lies in the fact that it is extremely time-consuming and expensive to build such knowledge bases and ontologies for different languages, that is, they again presuppose existence of high-quality external resources which effectively limits their portability to other language pairs and domains. Moreover, it is especially challenging to realize such explicit structures cross-lingually and define shared cross-lingual categories.

(3) Latent language-independent semantic concepts/axes (e.g., latent cross- lingual topics) induced by an algebraic model [81, 159], or more recently by a generative probabilistic model [117, 64, 312]. These approaches are fully data-driven as they utilize only internal evidence from a given corpus. However, all previous approaches still rely on language pair specific knowledge such as orthographic clues [155,117,33,64] or again require an initial bilingual lexicon [117,33] in modeling.

In this part of the thesis we are interested in the models of similarity from item (3). In other words, we are interested in a specific type of context features, that is, latent cross-lingual semantic topics/concepts. In summary, we explore the

models of cross-lingual semantic similarity and build a new statistical framework in a particularly difficult (but extremely cheap) minimalist setting which builds only on co-occurrence counts and latent cross-lingual semantic topics/concepts induced directly from comparable corpora, and which does not rely on any other resource (e.g., machine-readable dictionaries, parallel corpora, explicit ontology and category knowledge). In chapter9, we also tackle the models from item (1), but contrary to the prior work, we will demonstrate how to build these feature sets without any parallel data or external bilingual lexicons to obtain shared cross-lingual features. In that chapter, we will bootstrap these shared features from an initial seed set of features obtained by an initial model of similarity built within our minimalist setting (and effectively remaining within the same cheap minimalist setting).

6.2.3

Quick Notes on Terminology

Of Names and Naming Conventions. The term distributional models of semantic similarity which we predominantly use in this thesis is per se rather

vague. However, the reader must be aware that the relevant literature lists other terms that essentially refer to the exact same concept, such as distributional

94 MODELING SEMANTIC SIMILARITY BASED ON LATENT CROSS-LINGUAL TOPICS

Of Similarity and Relatedness. Even the term semantic similarity is vague as it may in general denote similarities between documents, words/phrases or relations [299]. This thesis tackles the problem of attributional similarity of words [298], which comprises standard taxonomic semantic relations such as synonymy (a relation between words with the same or similar meanings, e.g.,

buy and purchase), hyponymy and hypernymy (a hyponym is a word in a type-of

relation with its hypernym, e.g., pigeon is a hyponym of bird which is in turn a hyponym of animal), co-hyponymy (e.g., seagull and crow are co-hyponyms of the shared hypernym bird), etc. [17]. Words like cat and kitten, for instance, are attributionally similar in the sense that their meanings share a large number of attributes: they are animals, they meow, they like to drink milk, etc. The here investigated attributional similarity is opposed to relational similarity which refers to detecting properties and relations shared between pairs of words, e.g., cat-animal and car-vehicle. Moreover, the reader has to be aware that the concept of semantic similarity is more specific than semantic relatedness (although the two are sometimes used interchangeably) as relatedness includes concepts such as antonymy (a relation between two words with completely opposite meanings, e.g., war-peace) and meronymy (a relation where one word is a constituent part of another, e.g., finger-hand), while similarity does not [39].

Of Types and Tokens. A token is a single instance of a word symbol, whereas

a type is a general class of tokens [186]. If we take a quote from Samuel Beckett at the beginning of this chapter as an example, we say that word types such

Ever or again occur once in the quote, while there are two tokens/instances

of these word types occurring in the quote. A difference between type-based and token-based models of similarity is especially important for polysemous words (i.e., words that exhibit more than meaning such as plant, shed, bank or match), and we will extensively refer to that difference when building our context-sensitive models of cross-lingual similarity (chapter10).

6.3

Cross-Lingual Semantic Similarity via Latent

Documento similar